JP4409526B2

JP4409526B2 - Optical flow processor

Info

Publication number: JP4409526B2
Application number: JP2006112395A
Authority: JP
Inventors: 雅彦吉本; 正幸深山; 亮山本; 祐貴福山; 孝行峯岸; 忠義片桐
Original assignee: 株式会社半導体理工学研究センター
Priority date: 2006-04-14
Filing date: 2006-04-14
Publication date: 2010-02-03
Anticipated expiration: 2026-04-14
Also published as: JP2007286827A

Description

この発明は、動画の画像認識処理において画素毎の動きベクトルを求めるオプティカルフロー処理に関する。 The present invention relates to an optical flow process for obtaining a motion vector for each pixel in a moving image recognition process.

車両安全システム、知能ロボットシステム、知的監視システム等の、社会貢献度の高いアプリケーションへの応用が期待できる画像認識処理技術は非常に重要な研究課題であり、これらのビジョンシステムの実用化に向けて高精度かつ高解像度な実時間画像認識処理が要求されている。画素毎の高精細な動きベクトルを求めるオプティカルフロー処理は、様々な動画像認識処理分野に幅広く応用できる要素技術である。 Image recognition processing technology that can be expected to be applied to applications with high social contributions, such as vehicle safety systems, intelligent robot systems, intelligent surveillance systems, etc. is a very important research subject. Therefore, real-time image recognition processing with high accuracy and high resolution is required. Optical flow processing for obtaining a high-definition motion vector for each pixel is an elemental technology that can be widely applied to various moving image recognition processing fields.

オプティカルフローとは、画像中の１つ１つの画素がどれだけ動いたかを表す動きベクトルのことである。オプティカルフローを用いることによって細やかな認識が可能となり、高精細な動画像認識システムが実現できると考えられる。しかし、それぞれの画素の動きベクトルを計算するオプティカルフロー導出処理には、リアルタイムでかつ全画素でオプティカルフローを求めようとすると数十GOPSもの演算量を必要とするため、高解像度画像に対する実時間処理システムは実用化されていない。ソフトウェアにおける処理では特徴点のみのフロー導出処理や、精度を犠牲にした高速なフロー導出処理が現状では行われている。また、実時間処理ハードウェアは、ＦＰＧＡ、ＡＳＩＣなどを用いた低解像度または低精度のものに限られていた（非特許文献１，２参照）。
M.V. Correia and A.C. Campilho、 ICPR, vol. 4, pp. 247-250, 2002 J. Diaz, E. Ros, S. Mota, F. Pelay and E.M. Ortigosa, Early Cognitive Vision Workshop, Talk 21, 2004 B.K.P. Horn and B.G. Schunck, "Determining optical Flow", AI vol.17, pp.185-204, 1981 The optical flow is a motion vector that represents how much each pixel in the image has moved. By using the optical flow, detailed recognition is possible, and a high-definition moving image recognition system can be realized. However, the optical flow derivation process that calculates the motion vector of each pixel requires real-time processing for high-resolution images because it requires several tens of GOPS to calculate the optical flow in real time and for all pixels. The system has not been put into practical use. In software processing, flow derivation processing for only feature points and high-speed flow derivation processing at the expense of accuracy are currently performed. In addition, real-time processing hardware is limited to low resolution or low accuracy using FPGA, ASIC, or the like (see Non-Patent Documents 1 and 2).
MV Correia and AC Campilho, ICPR, vol. 4, pp. 247-250, 2002 J. Diaz, E. Ros, S. Mota, F. Pelay and EM Ortigosa, Early Cognitive Vision Workshop, Talk 21, 2004 BKP Horn and BG Schunck, "Determining optical Flow", AI vol.17, pp.185-204, 1981

本発明の目的は、高い演算負荷を実時間で処理するオプティカルフロープロセッサを提供することである。 An object of the present invention is to provide an optical flow processor that processes a high calculation load in real time.

本発明に係るオプティカルフロープロセッサは、入力される動画像について、次の式

（ここに、ｕ，ｖは、ｘ方向とｙ方向のオプティカルフローであり、ｎ＋１は繰返し回数であり、ave_uとave_vはｘ方向とｙ方向の平均オプティカルフローであり、E_x、E_y、E_tはｘ、ｙ、ｔ方向での輝度勾配であり、αは重み係数である）で表される演算を反復して画素ごとに動きベクトルを求める。このオプティカルフロープロセッサは、演算に必要なデータを入力し、解像度の異なる複数階層レベルでの階層画像作成、輝度勾配算出、オプティカルフロー導出と内挿、補間画像作成のための反復演算を行う共通演算器を備え、この共通演算器は、入力データについて加算を行う加算器と、入力データおよび／または加算器からのデータを演算する第１，第２，第３および第４の処理演算器と、第１から第４の処理演算器の演算結果を加算する累算器からなる。第１から第４の処理演算器の各々は、入力データを平均する平均化ブロックと、入力データの積和演算をする第１積和ブロックと、入力データの積和演算をする第２積和ブロックと、第１積和ブロックおよび／または第２積和ブロックからの入力データの除算、加算および減算を行う除算・加減算ブロックと、除算・加減算ブロックからの入力データおよび内部メモリからの入力データに対して積和演算をする第３積和ブロックと、第３積和ブロックからの入力データおよび内部メモリからの入力データに対して積和演算をする第４積和ブロックとからなる。このオプティカルフロープロセッサは、逐次的に実行される演算の種類に応じて、入力データを変更し、平均化ブロックと、第１積和ブロックと、第２積和ブロックと、除算・加減算ブロックと、第３積和ブロックと、第４積和ブロックを選択的に用い、データパスを変更して、複数階層レベルで階層画像を作成して、輝度勾配を算出して輝度勾配メモリに記憶し、最上位階層レベルについてオプティカルフローを導出してオプティカルフローメモリに記憶し、より解像度の大きい階層化画像について、上位階層レベルのオプティカルフローから下位階層レベルのオプティカルフローに変換して前記オプティカルフローメモリに記憶する内挿処理と、得られたオプティカルフローを用いて動き補償をする補間画像作成とを順次実行して、最終的なオプティカルフローを出力する。 The optical flow processor according to the present invention uses the following equation for an input moving image:

(Where u and v are optical flows in the x and y directions, n + 1 is the number of repetitions, ave_u and ave_v are average optical flows in the x and y directions, and E_x, E_y, and E_t are x , Y, and t, and α is a weighting coefficient) to obtain a motion vector for each pixel. This optical flow processor is a common operation that inputs the data necessary for the operation and performs iterative operations for creating hierarchical images, luminance gradient calculations, optical flow derivation and interpolation, and interpolated images at multiple levels with different resolutions. The common computing unit includes: an adder that performs addition on input data; first, second, third, and fourth processing computing units that operate on input data and / or data from the adder; It consists of an accumulator that adds the operation results of the first to fourth processing operation units. Each of the first to fourth processing arithmetic units includes an averaging block that averages the input data, a first product-sum block that performs a product-sum operation on the input data, and a second product-sum that performs a product-sum operation on the input data. A block, a division / addition / subtraction block that performs division, addition, and subtraction of input data from the first product-sum block and / or second product-sum block, input data from the division / addition / subtraction block, and input data from the internal memory A third product-sum block that performs a product-sum operation on the input and a fourth product-sum block that performs a product-sum operation on the input data from the third product-sum block and the input data from the internal memory. The optical flow processor changes the input data according to the type of operation executed sequentially , an averaging block, a first product-sum block, a second product-sum block, a division / addition / subtraction block, The third product-sum block and the fourth product-sum block are selectively used, the data path is changed , hierarchical images are created at a plurality of hierarchical levels, the luminance gradient is calculated and stored in the luminance gradient memory, and the The optical flow is derived for the upper hierarchical level and stored in the optical flow memory, and the hierarchical image having a higher resolution is converted from the optical flow of the upper hierarchical level to the optical flow of the lower hierarchical level and stored in the optical flow memory. and interpolation processing, and sequentially executes a creation interpolated image motion compensation using the obtained optical flow, final To output an optical flow.

前記オプティカルフロープロセッサは、好ましくは、さらに、外部および出力用バッファからデータを受け取る入力用バッファ、演算結果の少なくとも一部を記憶する内部メモリ、および、共通演算器の演算結果を記憶し出力する出力用バッファを備え、前記共通演算器は、入力バッファおよび内部メモリから演算に必要なデータを入力する。前記内部メモリは、たとえば、輝度勾配を記憶する輝度勾配メモリと、オプティカルフローを記憶するオプティカルフローメモリを含む。 The optical flow processor preferably further includes an input buffer for receiving data from the external and output buffers, an internal memory for storing at least a part of the operation result, and an output for storing and outputting the operation result of the common arithmetic unit. The common arithmetic unit inputs data necessary for the operation from the input buffer and the internal memory. The internal memory includes, for example, a luminance gradient memory that stores a luminance gradient and an optical flow memory that stores an optical flow.

前記オプティカルフロープロセッサにおいて、好ましくは、前記第１，第２，第３および第４の処理演算器の各々は、オプティカルフロー導出時に、並列の４画素の各々について、輝度勾配E_x，E_y，E_tと前フレームのオプティカルフローbefor_u，befor_vを入力し、前記平均化ブロックは、平均オプティカルフローave_u、ave_vを演算し、前記第１積和ブロックは、out_bel=E_x²+E_y²+α²を演算し、前記第２積和ブロックは、out_ber=E_x*ave_u＋E_y*ave_v＋E_tを演算し、前記除算・加減算ブロックは、div_add=out_ber/out_belを演算し、前記第３積和ブロックは、u=ave_u-E_x*div_addとv=ave_u-E_y*div_addを演算し、前記第４積和ブロックは、tmp=(befor_u-u)²+(befor_v-v)²を演算する。 In the optical flow processor, preferably, each of the first, second, third, and fourth processing arithmetic units is configured to calculate luminance gradients E_x, E_y, E_t for each of the four parallel pixels when the optical flow is derived. The optical flows befor_u and befor_v of the previous frame are input, the averaging block calculates average optical flows ave_u and ave_v, the first product-sum block calculates out_bel = E_x ² + E_y ² + α ² , The second product-sum block calculates out_ber = E_x * ave_u + E_y * ave_v + E_t, the division / addition / subtraction block calculates div_add = out_ber / out_bel, and the third product-sum block calculates u = ave_u-E_x * div_add And v = ave_u-E_y * div_add, and the fourth product-sum block calculates tmp = (befor_u-u) ² + (befor_v-v) ² .

前記オプティカルフロープロセッサにおいて、好ましくは、前記第１，第２，第３および第４の処理演算器は、輝度勾配算出時に、前フレーム、現フレームおよび次フレームの輝度値E_mx、LPFフィルタ係数lpf_xおよびdiffフィルタ係数diff_x(ここにx=l, m, n)を入力する。ここで、前記第１の処理演算器において、前記第１積和ブロックは、out_bel=E_ml*lpf0+E_mm*lpf1+E_mn*lpf0を演算し、前記第２積和ブロックは、out_ber=E_ml*diff0+E_mm*diff1＋E_mn*diff0を演算し、前記第３積和ブロックは、ｙ方向にTmp_lpf=(E_l+E_n)*lpf0+E_m*lpf1を演算する。また、前記第２の処理演算器において、前記第１積和ブロックは、out_bel=E_ml*lpf0+E_mm*lpf1+E_mn*lpf0を演算し、前記第２積和ブロックは、out_ber=E_ml*diff0+E_mm*diff1＋E_mn*diff0を演算し、前記除算・加減算ブロックは、out_belとout_berを出力し、前記第３積和ブロックは、第１，第２および第３の処理演算器からout_belとout_berをを入力し、ｘ方向にTmp_lpf=(E_l+E_n)*lpf0+E_m*lpf1を演算し、前記第４積和ブロックは、diff_0=diff0*(E_i-E_k)とdiff_1=diff0*(E_j-E_g)を演算する。また、前記第３の処理演算器において、前記第１積和ブロックは、out_bel=E_ml*lpf0+E_mm*lpf1+E_mn*lpf0を演算し、前記第２積和ブロックは、out_ber=E_ml*diff0+E_mm*diff1＋E_mn*diff0を演算し、前記除算・加減算ブロックは、out_belとout_berを出力し、前記第３積和ブロックは、第１，第２および第３の処理演算器からout_belとout_berをを入力し、ｘ方向にTmp_lpf=(E_l+E_n)*lpf0+E_m*lpf1を演算し、前記第４積和ブロックは、LPF係数を入力してtmp=(E_i+E_k)*lpf0+(E_j+E_g)*Lpf1を演算する。また、前記第４の処理演算器において、前記第３積和ブロックは、ｙ方向にTmp_lpf=(E_l+E_n)*lpf0+E_m*lpf1を演算する。 In the optical flow processor, it is preferable that the first, second, third, and fourth processing calculators calculate luminance values E_mx, LPF filter coefficients lpf_x of the previous frame, the current frame, and the next frame when calculating the luminance gradient. Enter the diff filter coefficient diff_x (where x = l, m, n). Here, in the first processing arithmetic unit, the first product-sum block calculates out_bel = E_ml * lpf0 + E_mm * lpf1 + E_mn * lpf0, and the second product-sum block outputs out_ber = E_ml * diff0 + E_mm * diff1 + E_mn * diff0 is calculated, and the third product-sum block calculates Tmp_lpf = (E_l + E_n) * lpf0 + E_m * lpf1 in the y direction. In the second processing arithmetic unit, the first product-sum block calculates out_bel = E_ml * lpf0 + E_mm * lpf1 + E_mn * lpf0, and the second product-sum block outputs out_ber = E_ml * diff0 + E_mm * diff1 + E_mn * diff0 is calculated, the division / addition / subtraction block outputs out_bel and out_ber, and the third product-sum block inputs out_bel and out_ber from the first, second, and third processing arithmetic units. Tmp_lpf = (E_l + E_n) * lpf0 + E_m * lpf1 is calculated in the x direction, and the fourth product-sum block calculates diff_0 = diff0 * (E_i-E_k) and diff_1 = diff0 * (E_j-E_g) Calculate. In the third processing arithmetic unit, the first product-sum block calculates out_bel = E_ml * lpf0 + E_mm * lpf1 + E_mn * lpf0, and the second product-sum block outputs out_ber = E_ml * diff0 + E_mm * diff1 + E_mn * diff0 is calculated, the division / addition / subtraction block outputs out_bel and out_ber, and the third product-sum block inputs out_bel and out_ber from the first, second, and third processing arithmetic units. Tmp_lpf = (E_l + E_n) * lpf0 + E_m * lpf1 is calculated in the x direction, and the fourth product-sum block inputs the LPF coefficient and tmp = (E_i + E_k) * lpf0 + (E_j + E_g) * Lpf1 is calculated. In the fourth processing arithmetic unit, the third product-sum block calculates Tmp_lpf = (E_l + E_n) * lpf0 + E_m * lpf1 in the y direction.

また、前記オプティカルフロープロセッサにおいて、たとえば、複数の前記共通演算器を並列に配置する。そして、隣接する共通演算器の間の隣接する２つの前記処理演算器において入力データを共有するように配線し、前記複数の共通演算器の中の１端側の共通演算器に、その他の共通演算器から、前のオプティカルフローデータを転送する。 In the optical flow processor, for example, a plurality of the common arithmetic units are arranged in parallel. Then, wiring is performed so that the input data is shared between the two adjacent processing operation units between the adjacent common operation units, and the other common operation unit is connected to the common operation unit at one end of the plurality of common operation units. The previous optical flow data is transferred from the arithmetic unit.

オプティカルフロー演算の最適化と、輝度勾配算出とオプティカルフロー導出を含む逐次的な演算における演算器の共通化によって小規模回路での高速・高精度な演算を実現する。 High-speed and high-precision calculations are realized in a small-scale circuit by optimizing the optical flow calculation and sharing the arithmetic unit in the sequential calculation including luminance gradient calculation and optical flow derivation.

以下、添付の図面を参照して発明の実施の形態を説明する。
図１は、オプティカルフロープロセッサの性能を、単位時間当たりの処理画素数と平均角度誤差（ＭＡＥ）で表すグラフである。ＭＡＥは精度の逆数に対応する。非特許文献１，２に記載されている従来のオプティカルフロープロセッサは、それぞれ、Ｄｉａｚ、Ｍｉｇｕｅｌと表示されている性能をもっている。本発明では、より高精度でより高速処理の実時間オプティカルフロープロセッサを実現する。図において、単位時間あたりの処理画素数３M、９M[pels/sec]は、それぞれ、30fr/sのCIF画像と30fr/sのVGA画像に対応する。本発明のオプティカルフロープロセッサは、後で説明するアーキテクチャを用いることにより、図示された高精度、高速処理が実現でき、高い演算負荷を実時間で処理できる。このＶＬＳＩアーキテクチャは、精度・解像度スケーラブルである。このスケーラビリティを利用して４つの共通演算器を並べることにより、30fr/sのVGA画像の実時間処理が可能である。また、従来のプロセッサの平均角度誤差(MAE)が10度以上であるのに対し、本発明のプロセッサではMAE10度未満を達成できる。 Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings.
FIG. 1 is a graph showing the performance of an optical flow processor in terms of the number of processed pixels per unit time and average angle error (MAE). MAE corresponds to the reciprocal of accuracy. The conventional optical flow processors described in Non-Patent Documents 1 and 2 have the performances indicated as Diaz and Miguel, respectively. The present invention realizes a real-time optical flow processor with higher accuracy and faster processing. In the figure, the number of processed pixels 3M and 9M [pels / sec] per unit time corresponds to a 30 fr / s CIF image and a 30 fr / s VGA image, respectively. The optical flow processor of the present invention can realize the high-precision and high-speed processing shown in the figure by using an architecture described later, and can process a high calculation load in real time. This VLSI architecture is scalable with accuracy and resolution. By arranging four common arithmetic units using this scalability, real-time processing of 30 fr / s VGA images is possible. Further, the average angle error (MAE) of the conventional processor is 10 degrees or more, while the processor of the present invention can achieve less than MAE 10 degrees.

オプティカルフローを演算するアルゴリズムについて、これまで様々なアルゴリズムが提案されている。オプティカルフローとは、画像中の各画素がどれだけ動いたかを表す動きベクトルである。ここで、高精度なHorn & Schunckアルゴリズム（非特許文献３）を用いる。このアルゴリズムは、近接するオプティカルフローが滑らかに変化するという大局的な仮定に基づく勾配法であり、下記の更新式を繰返し解くことによってオプティカルフローを得る。

（ここに、ｕ，ｖは、ｘ方向とｙ方向のオプティカルフローであり、ｎ＋１は繰返し回数であり、 ave_uとave_vはｘ方向とｙ方向の平均オプティカルフローであり、E_x、E_y、E_tはｘ、ｙ、ｔ方向での輝度勾配であり、αは重み係数である。） Various algorithms for calculating an optical flow have been proposed so far. The optical flow is a motion vector that represents how much each pixel in the image has moved. Here, a highly accurate Horn & Schunck algorithm (Non-Patent Document 3) is used. This algorithm is a gradient method based on a global assumption that adjacent optical flows change smoothly, and an optical flow is obtained by repeatedly solving the following update equation.

(Where u and v are the optical flows in the x and y directions, n + 1 is the number of repetitions, ave_u and ave_v are the average optical flows in the x and y directions, and E_x, E_y, and E_t are x ), Y, and t directions, and α is a weighting factor.)

ＶＬＳＩ実装向けに、高精度なHorn & Schunckアルゴリズムを用いたＨＯＥ(Hierarchical Optical-Flow Estimation)アルゴリズムを採用する。このアルゴリズムの特徴は、輝度勾配算出に、多次元勾配フィルタを採用している点や、階層化処理を行う点である。Horn & Schunckアルゴリズムでは、ＬＰＦ処理に前後７画素を必要とするが、ここでは３フレームを利用する多次元勾配フィルタを用いることによって、フレーム遅延をあまり大きくすることなく精度の高いフローを検出できる。また、階層化処理によって大きな動きの検出に対応できる。階層化処理とは、大きなフローの高精度な導出に有効な手法であり、ガウシアンフィルタと２:１サブサンプリングによって階層化画像を作成し、解像度の小さな画像から順番にフローを導出する手法である。ここでは、３階層の階層化処理を用いる。階層化処理は、ほとんどの画像において、更新式が収束するまでに必要なitr回数が小さくなるため、演算量の削減効果も得ることができる。 For VLSI implementation, the HOE (Hierarchical Optical-Flow Estimation) algorithm using the highly accurate Horn & Schunck algorithm is adopted. The feature of this algorithm is that it employs a multidimensional gradient filter for calculating the luminance gradient and performs a hierarchization process. The Horn & Schunck algorithm requires 7 pixels before and after LPF processing. Here, a multi-dimensional gradient filter using 3 frames can be used to detect a highly accurate flow without increasing the frame delay. In addition, it is possible to cope with detection of a large movement by the hierarchization processing. Hierarchization processing is an effective technique for derivation of a large flow with high accuracy, and is a technique for deriving a flow in order from an image with a smaller resolution by creating a hierarchical image by a Gaussian filter and 2: 1 subsampling. . Here, a three-level hierarchy process is used. In the hierarchization processing, the number of itr necessary for the update formula to converge is reduced in most images, so that the amount of calculation can be reduced.

図２は、ＨＯＥアルゴリズムのフローチャートを示す。オプティカルフロー処理の流れでは、まず、フィルタリングとサブサンプリングによって階層画像を作成し、階層化データバッファに記憶する（Ｓ１）。次に、画像データ（原画像または階層化画像）を入力して輝度勾配を算出し、輝度勾配メモリに記憶する（Ｓ２）。ここで、輝度は、多次元勾配フィルタを用いて、次の式
E0*lpf0+E1*lpf1+E2*lpf0
で求められる。ここに、E0, E1, E2は輝度値であり、lpf0, lpf1は多次元勾配フィルタに用いるLPFフィルタ係数である。ｘ，ｙ，ｚ方向の輝度勾配は、３フレームの輝度値を入力として、多次元勾配の式から導出される。次に、最上位の３フレーム（前フレーム、現フレームおよび次フレーム）からおおまかなオプティカルフローν２の導出を行い（Ｓ３）、それを用いてオプティカルフローの内挿を行い、下位階層レベルのオプティカルフローν１に変換する（Ｓ５）。この内挿処理は、重みが均一な双一次内挿処理であり、ここでは、周囲の４画素のオプティカルフローを入力し、４画素の平均を算出し、出力する。次に、得られたオプティカルフローν１を用いて前フレームおよび次フレームからフレーム補間つまり動き補償を行う（Ｓ６）。補間画像は、次の式で求められる。
Emage[x+y*Xsize] = (1.0-s)*(1.0-t)*SrcImage[X+Y*Xsize]
+ (1.0-s)*t*SrcImage[X+(Y+inc_y)*Xsize]
+ s*(1.0-t)*SrcImage[X+inc_X+Y*Xsize]
+ s*t*SrcImage[X+inc_x+(Y+inc_y)*Xsize]
ここに、s, tはオプティカルフローの小数部、SrcImageは次フレームまたは前フレームの輝度値、incはオプティカルフローの整数部である。オプティカルフローとフレーム輝度値を入力して、次フレームまたは前フレームの補間画像輝度値を算出し、出力する。次に、さらに下の階層の処理に移り(Ｓ７)、上述の処理を繰り返して、オプティカルフローを導出する。最低階層でのオプティカルフローが得られると、次に、得られた結果を加算することにより最終的なオプティカルフローを得る（Ｓ８）。そして、最初のステップＳ１に戻り、次のフレームを処理する。 FIG. 2 shows a flowchart of the HOE algorithm. In the flow of the optical flow process, first, a hierarchical image is created by filtering and sub-sampling, and stored in the hierarchical data buffer (S1). Next, image data (original image or hierarchized image) is input to calculate a luminance gradient, and stored in the luminance gradient memory (S2). Here, the brightness is calculated using the following equation using a multidimensional gradient filter:
E0 * lpf0 + E1 * lpf1 + E2 * lpf0
Is required. Here, E0, E1, and E2 are luminance values, and lpf0 and lpf1 are LPF filter coefficients used for the multidimensional gradient filter. The luminance gradient in the x, y, and z directions is derived from a multidimensional gradient equation with the luminance values of three frames as inputs. Next, an approximate optical flow ν2 is derived from the top three frames (previous frame, current frame and next frame) (S3), and optical flow is interpolated using the derivation of the optical flow, and the optical flow at the lower hierarchy level is calculated. Conversion to ν1 (S5). This interpolation process is a bilinear interpolation process with uniform weights. Here, an optical flow of four surrounding pixels is input, and an average of the four pixels is calculated and output. Next, frame interpolation, that is, motion compensation is performed from the previous frame and the next frame using the obtained optical flow ν1 (S6). The interpolated image is obtained by the following equation.
Emage [x + y * Xsize] = (1.0-s) * (1.0-t) * SrcImage [X + Y * Xsize]
+ (1.0-s) * t * SrcImage [X + (Y + inc_y) * Xsize]
+ s * (1.0-t) * SrcImage [X + inc_X + Y * Xsize]
+ s * t * SrcImage [X + inc_x + (Y + inc_y) * Xsize]
Here, s and t are the fractional parts of the optical flow, SrcImage is the luminance value of the next frame or the previous frame, and inc is the integer part of the optical flow. The optical flow and the frame luminance value are input, and the interpolated image luminance value of the next frame or the previous frame is calculated and output. Next, the process proceeds to a further lower level process (S7), and the above process is repeated to derive an optical flow. When the optical flow in the lowest hierarchy is obtained, the final optical flow is obtained by adding the obtained results (S8). Then, the process returns to the first step S1 to process the next frame.

ハードウェア実装において高精度、高処理速度での処理を実現するため、シミュレーションによりＨＯＥアルゴリズムのパラメータ最適化を行った。演算量に大きく影響する繰り返し回数itrについては、計算毎の画像全体のフローの平均更新量がたとえば10^-4以下となるまで更新式の演算が繰り返されるが、ハードウェア制御を容易にするために一定回数itrで繰返し処理を打切る。繰り返し回数itrは１５０が最適であった。また、重みパラメータαは、α＝１０が最適であった。 In order to realize processing with high accuracy and high processing speed in hardware implementation, parameters of the HOE algorithm were optimized by simulation. For the number of iterations itr that greatly affects the amount of computation, the update formula computation is repeated until the average update amount of the entire image flow for each computation is, for example, 10 ^-4 or less, but in order to facilitate hardware control Aborts the repeated processing at a fixed number of times itr. The optimum number of iterations itr is 150. The optimum weight parameter α was α = 10.

また、ＨＯＥアルゴリズムは、小数点精度を必要とする計算を多数行っているためにハードウェアコストが非常に高くなってしまう。そこで、フロー検出精度を損なわない最大のビット長まで演算語長を削減することによって、低負荷な処理を実現する。ここで、種々の削減パターンを定義し、ビット長最適化のシミュレーションを行った。ここで採用した削減パターンは、16bitと24bitの固定小数点を用い、整数部と小数部のビット割当ては適応的とする方式であり、浮動小数点の場合と同程度の平均角度誤差（MAE）およびitr回数を維持できる。こうして、HOEアルゴリズムのitr回数および演算語長をVLSI向けに最適化した。 In addition, the HOE algorithm performs a large number of calculations that require decimal point precision, and therefore the hardware cost becomes very high. Therefore, by reducing the operation word length to the maximum bit length that does not impair the flow detection accuracy, low-load processing is realized. Here, various reduction patterns were defined, and bit length optimization simulation was performed. The reduction pattern adopted here uses 16-bit and 24-bit fixed-point numbers, and the bit allocation in the integer and fractional parts is adaptive, with the same average angle error (MAE) and itr as in the floating-point case. The number of times can be maintained. In this way, the number of ITRs and operation word length of the HOE algorithm were optimized for VLSI.

図２に示されるＨＯＥアルゴリズムでは、階層画像作成、輝度勾配作成、オプティカルフロー導出、補間画像作成および双一次内挿の５つの処理をリアルタイムで順次行う(図３参照）。これら５つの処理の処理量が均一であればパイプライン並列処理により動作周波数を下げて低消費電力化できるが、繰り返し処理を含むオプティカルフロー導出部が全体の演算の大部分を占めているため、効率的なパイプラインを組むことができない。オプティカルフロー導出、輝度勾配算出、階層画像作成などの処理は基本的に積和演算を行う。そこで、ＨＯＥアルゴリズムのハードウェア実装において、回路規模削減のため、全ての演算を行う共通演算器CEを開発した。ここで、積和演算を行う複数のブロックを設け、各処理ごとにデータパスを変えることにより演算器の共通化を図っている。この共通演算器CEは、実時間演算処理を実現するために４画素同時処理可能な専用4-way SIMDデータパス回路を用いている。 In the HOE algorithm shown in FIG. 2, five processes of hierarchical image creation, luminance gradient creation, optical flow derivation, interpolation image creation, and bilinear interpolation are sequentially performed in real time (see FIG. 3). If the processing amount of these five processes is uniform, the operation frequency can be lowered and the power consumption can be reduced by pipeline parallel processing, but the optical flow deriving unit including the repetitive processing occupies most of the entire calculation. An efficient pipeline cannot be built. Processes such as optical flow derivation, luminance gradient calculation, and hierarchical image creation basically perform product-sum operations. Therefore, in the hardware implementation of the HOE algorithm, a common arithmetic unit CE that performs all operations has been developed to reduce the circuit scale. Here, a plurality of blocks for performing a product-sum operation are provided, and the arithmetic unit is shared by changing the data path for each process. This common arithmetic unit CE uses a dedicated 4-way SIMD data path circuit capable of simultaneous processing of four pixels in order to realize real-time arithmetic processing.

図４は、共通演算器CEを用いたオプティカルフロープロセッサの全体アーキテクチャを示す。オプティカルフロープロセッサは、外部メモリ（ＳＤＲＡＭ）とＣＰＵに接続される。外部メモリには、原画像データ用メモリ、最終オプティカルフローデータ用メモリおよび階層化データ用メモリがある。ＣＰＵからの制御信号により各処理に必要なデータを外部メモリから、入力バッファである定数格納用バッファと輝度値格納用バッファへ転送する。共通演算器は、セレクタを介して入力されるいずれかの入力バッファのデータまたは内部メモリのデータを使って演算処理を行う。そして、得られた処理結果を、出力用バッファまたは内部メモリ（オプティカルフロー格納用メモリまたは輝度勾配格納メモリ）へ転送する。なお、共通演算器CEは、図示しないシーケンスコントローラを含む。なお、メモリバス帯域と内蔵メモリ容量のトレードオフにより輝度勾配データとオプティカルフローデータのみを内蔵メモリに格納することにした。なお、内部メモリを用いるか否か、また、内部メモリを用いる場合、どのようなデータを記憶するかは、適当に選択すればよい。 FIG. 4 shows the overall architecture of an optical flow processor using a common arithmetic unit CE. The optical flow processor is connected to an external memory (SDRAM) and a CPU. The external memory includes an original image data memory, a final optical flow data memory, and a hierarchical data memory. Data necessary for each process is transferred from an external memory to a constant storage buffer and a luminance value storage buffer, which are input buffers, by a control signal from the CPU. The common arithmetic unit performs arithmetic processing using the data of any input buffer or the data of the internal memory input via the selector. Then, the obtained processing result is transferred to an output buffer or an internal memory (optical flow storage memory or luminance gradient storage memory). The common arithmetic unit CE includes a sequence controller (not shown). Note that only the brightness gradient data and the optical flow data are stored in the built-in memory due to a trade-off between the memory bus bandwidth and the built-in memory capacity. Note that whether or not to use the internal memory and what kind of data to store when using the internal memory may be appropriately selected.

図５は、共通演算器CEの内部ブロックを含む全体アーキテクチャを示す。共通演算器CEは、積和処理を行う複数の演算ブロック（処理演算器PE)、演算ブロックへの入力データの加算処理を行う複数の加算ブロック(ADD)、複数の演算ブロック(PE)の演算結果の加算処理を行う累算ブロック(ACC)からなる。ここで、最も演算量が多いオプティカルフロー導出時に並列処理をするため４つの処理演算器PEを並列に配置して、１つの処理演算器PEで１画素を処理する４画素並列処理のＳＩＭＤ構成とする。また、その他の処理時でも、１画素ごとの処理をスループット１で実現できる構成とする。 FIG. 5 shows an overall architecture including internal blocks of the common arithmetic unit CE. The common arithmetic unit CE includes a plurality of arithmetic blocks (processing arithmetic unit PE) for performing product-sum processing, a plurality of addition blocks (ADD) for performing addition processing of input data to the arithmetic block, and a plurality of arithmetic blocks (PE). It consists of an accumulation block (ACC) that performs the result addition process. Here, in order to perform parallel processing at the time of deriving the optical flow with the largest calculation amount, four processing arithmetic units PE are arranged in parallel, and one pixel is processed by one processing arithmetic unit PE. To do. In addition, the processing for each pixel can be realized with a throughput of 1 even during other processing.

図６は、共通演算器CEの中の処理演算器PEの内部ブロック図である。オプティカルフロー導出、輝度勾配算出、階層画像作成などほぼ全ての処理は基本的に積和処理を行う。そこで、積和演算を行う複数のブロックを設け、各処理ごとにデータパスを変えることにより演算器の共通化を行った。共通演算器CEでは、CPUからの制御信号により各演算に必要なデータが外部メモリからバッファに転送され、CPUの制御信号でデータパスを変えられる。これにより１つの共通演算器CEでオプティカルフロー導出に必要な全ての処理を行う。 FIG. 6 is an internal block diagram of the processing arithmetic unit PE in the common arithmetic unit CE. Almost all processes such as optical flow derivation, luminance gradient calculation, and hierarchical image creation basically perform product-sum processing. Therefore, a plurality of blocks for performing product-sum operations are provided, and the arithmetic units are shared by changing the data path for each process. In the common arithmetic unit CE, data necessary for each calculation is transferred from the external memory to the buffer by the control signal from the CPU, and the data path can be changed by the control signal of the CPU. As a result, all the processing necessary for deriving the optical flow is performed by one common arithmetic unit CE.

処理演算器PEは、６つの演算エレメント、すなわち、平均化、加算およびシフト処理のための平均化ブロック(AVE)、第１と第２の積和フィルタである積和ブロック(BELとBER)、除算や加減算を行う除算・加減算ブロック(Div/ADD)、第３の積和フィルタであるCalc/LPFブロック、および、第４の積和フィルタである差分ブロック(Diff)で構成されている。たとえばオプティカルフロー導出では、外部からのデータは、一方では、第１の積和フィルタ（BELブロック）に入力され、他方では、平均化ブロック(AVE)を経て第２の積和フィルタ（BERブロック）に入力される。除算・加減算ブロック(Div/ADD)は、両積和フィルタBEL、BERの処理結果を入力して除算をする。その結果は直接出力されるか、または、第３の積和フィルタであるCalc/LPFブロックで処理されたのち、第４の積和フィルタである差分ブロックDiffで入力データとの差分がとられ、その結果が出力される。 The processing arithmetic unit PE has six arithmetic elements, that is, an averaging block (AVE) for averaging, addition and shift processing, a product-sum block (BEL and BER) as first and second product-sum filters, A division / addition / subtraction block (Div / ADD) that performs division and addition / subtraction, a Calc / LPF block that is a third product-sum filter, and a difference block (Diff) that is a fourth product-sum filter. For example, in optical flow derivation, external data is input to the first product-sum filter (BEL block) on the one hand, and on the other hand, the second product-sum filter (BER block) via the averaging block (AVE). Is input. The division / addition / subtraction block (Div / ADD) performs division by inputting the processing results of both the product-sum filters BEL and BER. The result is output directly, or after being processed by the Calc / LPF block which is the third product-sum filter, the difference from the input data is taken by the difference block Diff which is the fourth product-sum filter, The result is output.

図７はオプティカルフロー導出時のデータパスを示す。オプティカルフロー導出時には全ての処理演算器PE内の演算ブロックを用いて演算を行う。ここで、４つの処理演算器PEを用いて４画素並列処理とする。まず、オプティカルフロー格納メモリから平均化ブロックAVEにオプティカルフローデータが転送され、局所平均ave_u, ave_vが求められる。次に、BEL、BERブロックでは、オプティカルフロー更新式の分母Ex^２＋Ey^２＋α^２と、分子ave_u＊Ex＋ave_v＊Ey＋Etの計算を行い、除算・加減算ブロック(Div/ADD)は、これらを除算してdiv=分子／分母を求める。次に、Calc/LPFブロックは、u_n+1＝ave_u−Ex*div、v_n+1＝ave_v−Ey*divを計算して、オプティカルフローをu_n+1、v_n+1に更新する。更新されたオプティカルフローは、内部メモリであるオプティカルフロー格納メモリに書き込まれる。差分ブロック(Diff)では、オプティカルフロー更新量(ave_u−u_n+1)²＋(ave_v−v_n+1)²を計算する。累算ブロック(ACC)は、４画素のオプティカルフロー更新量を加算する。この値がしきい値と比較される。下記の表１は、オプティカルフロー導出時の処理演算器PEの各ブロックが行う処理を示す。

FIG. 7 shows a data path when the optical flow is derived. At the time of optical flow derivation, computation is performed using computation blocks in all processing computing units PE. Here, four-pixel parallel processing is performed using four processing computing units PE. First, optical flow data is transferred from the optical flow storage memory to the averaging block AVE, and local averages ave_u and ave_v are obtained. Next, in the BEL and BER blocks, the denominator Ex ² + Ey ² + α ² of the optical flow update formula and the numerator ave_u * Ex + ave_v * Ey + Et are calculated, and the division / addition / subtraction block (Div / ADD) divides these Find div = numerator / denominator. Next, the Calc / LPF block calculates u _{n + 1} = ave_u−Ex * div, v _{n + 1} = ave_v−Ey * div, and updates the optical flow to u _{n + 1} and v _{n + 1} . . The updated optical flow is written into an optical flow storage memory that is an internal memory. In the difference block (Diff), the optical flow update amount (ave_u−u _{n + 1} ) ² + (ave_v−v _{n + 1} ) ² is calculated. The accumulation block (ACC) adds the optical flow update amount of 4 pixels. This value is compared with a threshold value. Table 1 below shows the processing performed by each block of the processing arithmetic unit PE when the optical flow is derived.

また図８は輝度勾配算出時のデータパスを示す。ここで輝度勾配算出に必要な演算ブロックが選択され、それらを用いて、演算処理を行う。図に示されるように、４つの処理演算器PE0、PE1、PE2、PE3は、異なる処理を行う。処理演算器PE0、PE1、PE2は、それぞれ、ｘ、ｙ、ｔ方向の輝度勾配を計算する。処理演算器PE0では、Diffブロックは動作しない。また、処理演算器PE3では、Calc_LPFブロックのみ動作する。まず、外部メモリから前フレーム、現フレームおよび次フレームの輝度値が入力用バッファに転送される。BEL,BERブロックでは、ｔ方向にそれぞれLPFフィルタリング(out_bel=E_ml＊lpf0＋E_mm*lpf1＋E_mn*lpf0)とDiffフィルタリング(out_ber＝E_ml＊diff0＋E_mm*diff1＋E_mn*diff0)を行う。次に、Calc_LPFブロックで、ｘ方向またはｙ方向にLPFフィルタリング(Tmp_lpf＝(E_l+E_n)*(lpf0+E_m*lpf1))を行う。ここで、E_mxは輝度値を表し、lpfxは多次元勾配フィルタに用いるLPFフィルタ係数を表し、diffxは多次元勾配フィルタに用いるdiffフィルタ係数を表し、diff_0、diff_1はｘ方向とｙ方向の輝度勾配を表す。最後に、Diffブロックで、ｘ、ｙ方向にDiffフィルタリングとｙ方向にＬＰＦフィルタリングを行う。これにより、４つの処理演算器PEを用いて、１画素のx、y、t方向の輝度勾配が求められる。このように、ｘ方向輝度勾配は、処理演算器PE0において、ｔ方向にLPFフィルタ→ｙ方向にLPFフィルタ→ｘ方向にDiffフィルタの処理をして求められ、ｙ方向輝度勾配は、処理演算器PE1において、ｔ方向にLPFフィルタ→ｘ方向にLPFフィルタ→ｙ方向にDiffフィルタの処理をして求められる。また、ｔ方向輝度勾配は、処理演算器PE2において、ｔ方向にDiffフィルタ→ｘ方向にLPFフィルタ→ｙ方向にLPFフィルタの処理をして求められる。 FIG. 8 shows a data path when calculating the luminance gradient. Here, calculation blocks necessary for calculating the luminance gradient are selected, and calculation processing is performed using them. As shown in the figure, the four processing calculators PE0, PE1, PE2, and PE3 perform different processes. The processing calculators PE0, PE1, and PE2 calculate luminance gradients in the x, y, and t directions, respectively. In the processing arithmetic unit PE0, the Diff block does not operate. Further, only the Calc_LPF block operates in the processing arithmetic unit PE3. First, the luminance values of the previous frame, the current frame, and the next frame are transferred from the external memory to the input buffer. In the BEL and BER blocks, LPF filtering (out_bel = E_ml * lpf0 + E_mm * lpf1 + E_mn * lpf0) and Diff filtering (out_ber = E_ml * diff0 + E_mm * diff1 + E_mn * diff0) are performed in the t direction. Next, LPF filtering (Tmp_lpf = (E_l + E_n) * (lpf0 + E_m * lpf1)) is performed in the x direction or the y direction in the Calc_LPF block. Here, E_mx represents a luminance value, lpfx represents an LPF filter coefficient used for the multidimensional gradient filter, diffx represents a diff filter coefficient used for the multidimensional gradient filter, diff_0 and diff_1 are luminance gradients in the x and y directions Represents. Finally, the Diff block performs Diff filtering in the x and y directions and LPF filtering in the y direction. Thus, the luminance gradients in the x, y, and t directions of one pixel are obtained using the four processing arithmetic units PE. In this way, the x-direction luminance gradient is obtained by processing the LPF filter in the t direction → the LPF filter in the y direction → the Diff filter in the x direction in the processing calculator PE0. In PE1, LPF filter in the t direction → LPF filter in the x direction → Diff filter processing in the y direction. Further, the t-direction luminance gradient is obtained by processing the Diff filter in the t direction, the LPF filter in the x direction, and the LPF filter in the y direction in the processing calculator PE2.

下記の表２は輝度勾配計算時の処理演算器PE内の各ブロックの処理を示す。ここで、ブロック名の後の0,1,2,3は、処理演算器PE0,PE1,PE2,PE3の中のブロックであることを表す。

Table 2 below shows the processing of each block in the processing computing unit PE when calculating the luminance gradient. Here, 0, 1, 2, and 3 after the block name represent blocks in the processing arithmetic units PE0, PE1, PE2, and PE3.

図９は、階層画像作成時のデータパスを示す。階層画像は、２５画素の輝度値データを３つの処理演算器で積和処理し、１画素の階層画像が作成できる。まず、輝度値データが外部メモリから入力用バッファに転送される。そして、加算ブロックADD1、ADD2で原画像の輝度値の加算値img_xを求め、第０、第１、第２の処理演算器内の５つの積和ブロックBER, BELで、それぞれ１行分のガウシアンフィルタ処理の乗算処理out_bel=img_a1*A+img_a2*B+img_a3*C、out_bel=img_a1*C+img_a2*D+img_a3*F、・・・、out_bel=img_e1*B+img_e2*D+img_e3*Fなどを演算する。ここに、A,B,C,D,E,Fはガウシアンフィルタの係数である。さらに、その乗算結果を除算・加減算ブロックDiv＿Addと累算ブロックACCとで５行分加算して、上位階層画像の１画素の輝度値img（＝img_a1*A+img_a2*B+img_a3*C+ img_b1*B+img_b2*C+img_b3*D+img_c1*C+img_c2*D+img_c3*E+img_d1*B+img_d2*D+img_d3*E+img_e1*B+img_e2*D+img_e3*F）を出力する。 FIG. 9 shows a data path when creating a hierarchical image. The hierarchical image can be obtained by multiplying the luminance value data of 25 pixels by three processing arithmetic units and creating a one-pixel hierarchical image. First, the luminance value data is transferred from the external memory to the input buffer. Then, an addition value img_x of the luminance value of the original image is obtained by the addition blocks ADD1 and ADD2, and Gaussian for one row is obtained by each of the five product-sum blocks BER and BEL in the 0th, 1st and 2nd processing arithmetic units. Filter processing multiplication out_bel = img_a1 * A + img_a2 * B + img_a3 * C, out_bel = img_a1 * C + img_a2 * D + img_a3 * F, ..., out_bel = img_e1 * B + img_e2 * D + img_e3 * F And so on. Here, A, B, C, D, E, and F are coefficients of the Gaussian filter. Further, the multiplication result is added by 5 lines in the division / addition / subtraction block Div_Add and the accumulation block ACC, and the luminance value img (= img_a1 * A + img_a2 * B + img_a3 * C + img_b1 * of the upper layer image) B + img_b2 * C + img_b3 * D + img_c1 * C + img_c2 * D + img_c3 * E + img_d1 * B + img_d2 * D + img_d3 * E + img_e1 * B + img_e2 * D + img_e3 * F).

図１０は、双一次内挿時のデータパスを示す。オプティカルフローデータが平均化ブロックAVEに転送され、加算、シフト処理が行われる。そして、処理結果が再びオプティカルフロー格納メモリに転送される。 FIG. 10 shows a data path at the time of bilinear interpolation. The optical flow data is transferred to the averaging block AVE, and addition and shift processing are performed. Then, the processing result is transferred again to the optical flow storage memory.

図１１は、ベクトル加算時のデータパスを示す。まず、外部メモリから上位階層で内挿されたオプティカルフローデータが入力用バッファに転送される。そして、内部メモリのオプティカルフローデータと入力用バッファのオプティカルフローデータを平均化ブロックAVEで加算する。処理結果は、オプティカルフローメモリに格納される。 FIG. 11 shows a data path at the time of vector addition. First, the optical flow data interpolated in the upper hierarchy from the external memory is transferred to the input buffer. Then, the optical flow data in the internal memory and the optical flow data in the input buffer are added by the averaging block AVE. The processing result is stored in the optical flow memory.

図１２は、補間画像作成時のデータパスを示す。補間画像作成時には、輝度勾配メモリに画像の輝度値データが格納されている。まず、オプティカルフロー格納メモリからオプティカルフロー整数部データが輝度勾配メモリへ転送され、原画像の輝度値データがCalc_LPFブロックとDiffブロックに転送される。それと同時にオプティカルフローの小数部データが積和ブロックBERに転送されて乗算が行われ、係数stが決定され、最終的に除算・加減算ブロックDiv_Addで４つの係数st、s(1-t)、t(1-t)、(1-s)(1-t)が決定される。Calc_LPFブロックとDiffブロックで、除算・加減算ブロックDiv_Addで求めた係数と輝度勾配メモリから転送された輝度値が乗算され、Tmplpf=E_c*t(1-s)+E_d*(1-s)(1-t)とtmp＝E_a*st+E_b*s(1-t)が求められ、最後に累算ブロックACCで両者の累算処理をして、出力用バッファに転送する。 FIG. 12 shows a data path when creating an interpolation image. At the time of creating an interpolation image, the luminance value data of the image is stored in the luminance gradient memory. First, the optical flow integer part data is transferred from the optical flow storage memory to the luminance gradient memory, and the luminance value data of the original image is transferred to the Calc_LPF block and the Diff block. At the same time, the fractional data of the optical flow is transferred to the sum-of-products block BER and multiplied, the coefficient st is determined, and finally the four coefficients st, s (1-t), t in the division / addition / subtraction block Div_Add (1-t), (1-s) (1-t) are determined. In the Calc_LPF block and Diff block, the coefficient obtained by the division / addition / subtraction block Div_Add is multiplied by the luminance value transferred from the luminance gradient memory, and Tmplpf = E_c * t (1-s) + E_d * (1-s) (1 -t) and tmp = E_a * st + E_b * s (1-t) are obtained, and finally, accumulation processing of both is performed in the accumulation block ACC and transferred to the output buffer.

図１３、図１４、図１５、図１６、図１７、図１８、図１９、図２０、図２１は、それぞれ、共通演算器CE内のAVE演算ブロック、BER演算ブロック、BEL演算ブロック、Calc_LPF演算ブロック、Div_add演算ブロック、Diff演算ブロック、ADD1演算ブロック、ADD2演算ブロックおよびACC演算ブロックの演算回路を示す。また、表３〜表１１は、これらの演算ブロックで扱う信号を示す。 13, FIG. 15, FIG. 15, FIG. 16, FIG. 17, FIG. 18, FIG. 19, FIG. 20, and FIG. 21 respectively show the AVE operation block, BER operation block, BEL operation block, and Calc_LPF operation in the common operation unit CE. FIG. 4 shows arithmetic circuits of a block, a Div_add arithmetic block, a Diff arithmetic block, an ADD1 arithmetic block, an ADD2 arithmetic block, and an ACC arithmetic block. Tables 3 to 11 show signals handled by these calculation blocks.

表３は、AVE演算ブロックで扱う信号を示す。

Table 3 shows signals handled in the AVE operation block.

表４は、BER演算ブロックで扱う信号を示す。

Table 4 shows signals handled by the BER calculation block.

表５は、BEL演算ブロックで扱う信号を示す。

Table 5 shows signals handled by the BEL operation block.

表６は、Calc_LPF演算ブロックで扱う信号を示す。

Table 6 shows signals handled in the Calc_LPF calculation block.

表７は、Div_add演算ブロックで扱う信号を示す。

Table 7 shows signals handled by the Div_add operation block.

表８は、diff演算ブロックで扱う信号を示す。

Table 8 shows signals handled by the diff operation block.

表９は、ADD1演算ブロックで扱う信号を示す。

Table 9 shows signals handled in the ADD1 operation block.

表１０は、ADD２演算ブロックで扱う信号を示す。

Table 10 shows signals handled in the ADD2 operation block.

表１１は、ACC演算ブロックで扱う信号を示す。

Table 11 shows signals handled in the ACC calculation block.

また、図２２と図２３は、それぞれ、処理演算器PEの内部エレメントの１つであるBER演算ブロックのオプティカルフロー導出時のデータパスと輝度勾配算出時のデータパスを示す。矢印はマルチプレクサにおけるデータパスの方向を示す。他の演算エレメントでも、図示しないが、同様に処理内容に応じて入力データとデータパスを切り換える。このようにCPUの制御信号によりマルチプレクサでデータパスを変えることで、内部エレメントを選択的に用いることにより、共通演算器はあらゆる演算に対応できる。 FIG. 22 and FIG. 23 respectively show a data path at the time of optical flow derivation and a data path at the time of luminance gradient calculation of a BER calculation block that is one of the internal elements of the processing calculator PE. The arrow indicates the direction of the data path in the multiplexer. Although not shown in the figure, other input elements and data paths are similarly switched according to the processing contents. In this way, by changing the data path by the multiplexer according to the control signal of the CPU, the common arithmetic unit can cope with any operation by selectively using the internal elements.

上述のオプティカルフロープロセッサは、専用データバス回路(並列化とパイプライン化)による効率的な実時間処理を可能にしたので、図１に示したように、１階層あたり150回のitr回数でCIF30fpsを高精度(全テスト画像の平均MAE＝5.2度)で処理できる。すなわち、CIF30以上のシーケンスに対して、MAE（精度の指標）＜１０以下の高精度のオプティカルフローを実時間で抽出できる。 The optical flow processor described above enables efficient real-time processing using dedicated data bus circuits (parallelization and pipeline processing). As shown in FIG. 1, CIF30fps with 150 itr times per layer. Can be processed with high accuracy (average MAE of all test images = 5.2 degrees). That is, a high-precision optical flow with MAE (index of accuracy) <10 or less can be extracted in real time for a sequence of CIF 30 or higher.

上述のオプティカルフロープロセッサは、高精度、高解像度に対応したスケーラブルアーキテクチャを実現した。ここで、高精度、高解像度を要求するアプリケーションでは、上述の共通演算器を単純に複数並べることで、高精度、高解像度に対応できる。ここで、並列に配置される共通演算器CEがデータの共有を行うため結線を変更している。たとえば、隣接する共通演算器の間の隣接する２つの処理演算器PEにおいて入力データを共有するように配線し、また、複数の共通演算器の中の１端側の共通演算器に、その他の共通演算器から、前のオプティカルフローデータを転送する。これにより多くのアプリケーションに対応できる。精度スケーラブル、解像度スケーラブルであるので、動き特徴量抽出手段として多くの動画像認識処理システムに応用できる。 The optical flow processor described above has realized a scalable architecture corresponding to high precision and high resolution. Here, in an application that requires high accuracy and high resolution, high accuracy and high resolution can be supported by simply arranging a plurality of the common arithmetic units described above. Here, the connection is changed so that the common arithmetic units CE arranged in parallel share data. For example, wiring is performed so that input data is shared between two adjacent processing operation units PE between adjacent common operation units, and the common operation unit at one end of the plurality of common operation units Transfer the previous optical flow data from the common arithmetic unit. Thereby, it can cope with many applications. Since it is precision scalable and resolution scalable, it can be applied to many moving image recognition processing systems as motion feature amount extraction means.

例えば２つのオプティカルフロープロセッサを２つ並べた場合、オプティカルフロー導出部において繰り返し回数を２倍にして高精度の対応が可能である。図２４にオプティカルフロープロセッサを２つ並べた場合の各プロセッサが処理するフレームの画素を示す。この場合８画素並列処理となり、繰り返し回数を２倍にして高精度のオプティカルフローが求められる。図２５にオプティカルフロープロセッサを２つ並べた場合のアーキテクチャを示す。１つの共通演算器CEで４画素のオプティカルフローを求めるには、４画素とその左右の１画素の計６画素が必要となるため、隣接する共通演算器、すなわち、第１の共通演算器CE1と第０の共通演算器CE0とでデータの共有が必要である。データの共用のため、データ配線が変更できる。具体的には２つの共通演算器CEOとCE1の間の、隣接する画素を扱う各々の１列を互いに転送し、それと同時に次列処理のため、第１の共通演算器CE1から第０の共通演算器CEOのための内部メモリに２列分のオプティカルフローデータを転送する必要がある。 For example, when two optical flow processors are arranged in parallel, the number of repetitions can be doubled in the optical flow deriving unit, and high-precision correspondence is possible. FIG. 24 shows a pixel of a frame processed by each processor when two optical flow processors are arranged. In this case, 8-pixel parallel processing is performed, and the number of repetitions is doubled to obtain a highly accurate optical flow. FIG. 25 shows an architecture when two optical flow processors are arranged. In order to obtain an optical flow of 4 pixels with one common arithmetic unit CE, a total of 6 pixels, that is, four pixels and one pixel on the left and right sides thereof are required. Therefore, adjacent common arithmetic units, that is, the first common arithmetic unit CE1. And the 0th common arithmetic unit CE0 need to share data. Data wiring can be changed for data sharing. Specifically, each column that handles adjacent pixels between the two common arithmetic units CEO and CE1 is transferred to each other, and at the same time, the first common arithmetic unit CE1 to the zeroth common for the next column processing. It is necessary to transfer the optical flow data for two columns to the internal memory for the computing unit CEO.

また、オプティカルフロープロセッサを並列に配置することでより大きな画素サイズ-の対応も可能である。図２６にNTSCサイズを４並列処理する場合の処理フレーム画素イメージを示す。処理すべきフレームを水平方向に分割することで、垂直方向の画素サイズが変わっても共通演算器CEのもつオプティカルフローメモリと輝度勾配メモリのサイズは変わらない。このため、オプティカルフロープロセッサを単純に接続することでより大きな画像サイズへの対応が可能である。共通演算器の接続は、図２４と同様で４つを並べ、第３の共通演算器CE3に当たる右端２列のオプティカルフローデータを第０の共通演算器CEOの前フローデータを格納するメモリに転送するよう接続する。たとえば、共通演算器CEを４つ並べた場合は１階層あたり150回のitr回数でVGA30の画像を処理することが可能となる。 Also, a larger pixel size can be accommodated by arranging optical flow processors in parallel. FIG. 26 shows a processing frame pixel image when the NTSC size is subjected to four parallel processing. By dividing the frame to be processed in the horizontal direction, the size of the optical flow memory and the luminance gradient memory of the common arithmetic unit CE does not change even if the pixel size in the vertical direction changes. For this reason, it is possible to cope with a larger image size by simply connecting an optical flow processor. The common arithmetic units are connected in the same way as in FIG. 24, and the four are arranged, and the optical flow data in the rightmost two columns corresponding to the third common arithmetic unit CE3 are transferred to the memory storing the previous flow data of the zeroth common arithmetic unit CEO. Connect to do. For example, when four common arithmetic units CE are arranged, it is possible to process an image of the VGA 30 with 150 itr times per layer.

オプティカルフロープロセッサの性能を単位時間当たりの処理画素数と平均角度誤差（ＭＡＥ）で表すグラフA graph representing the performance of an optical flow processor in terms of the number of processing pixels per unit time and the mean angle error (MAE) ＨＯＥアルゴリズムのフローチャートHOE algorithm flow chart ＨＯＥアルゴリズムにおける５つの処理を示す図Diagram showing five processes in the HOE algorithm オプティカルフロープロセッサの全体アーキテクチャを示す図Diagram showing the overall architecture of the optical flow processor 共通演算器の内部ブロックを含む全体アーキテクチャを示す図Diagram showing the overall architecture including the internal blocks of the common arithmetic unit 共通演算器の中の処理演算器の内部ブロック図Internal block diagram of the processing arithmetic unit in the common arithmetic unit オプティカルフロー導出時のデータパスを示す図Diagram showing data path when optical flow is derived 輝度勾配算出時のデータパスを示す図Diagram showing the data path when calculating the brightness gradient 階層画像作成時のデータパスを示す図Diagram showing the data path when creating a hierarchical image 双一次内挿時のデータパスを示す図Diagram showing data path during bilinear interpolation ベクトル加算時のデータパスを示す図Diagram showing data path when adding vectors 補間画像作成時のデータパスを示す図Diagram showing the data path when creating an interpolated image AVE演算ブロックの演算回路図Calculation circuit diagram of AVE calculation block BER演算ブロックの演算回路図Calculation circuit diagram of BER calculation block BEL演算ブロックの演算回路図Calculation circuit diagram of BEL calculation block Calc_LPF演算ブロックの演算回路図Calculation circuit diagram of Calc_LPF calculation block Div_add演算ブロックの演算回路図Calculation circuit diagram of Div_add calculation block Diff演算ブロックの演算回路図Arithmetic circuit diagram of Diff operation block ADD1演算ブロックの演算回路図Calculation circuit diagram of ADD1 calculation block ADD2演算ブロックの演算回路図Calculation circuit diagram of ADD2 calculation block ACC演算ブロックの演算回路図Calculation circuit diagram of ACC calculation block BER演算ブロックのオプティカルフロー導出時のデータパスを示す図Diagram showing data path when BER calculation block optical flow is derived BER演算ブロックの輝度勾配算出時のデータパスを示す図The figure which shows the data path at the time of luminance gradient calculation of BER calculation block 各共通演算器が処理するフレームの画素の図Diagram of frame pixels processed by each common arithmetic unit スケーラブルアーキテクチャを説明するための図Diagram for explaining scalable architecture NTSCサイズを４並列で処理する場合の処理フレーム画素イメージの図Image of processing frame pixel image when NTSC size is processed in 4 parallel

Explanation of symbols

PE 処理演算器、 ADD1、ADD２加算ブロック、 ACC 累算ブロック、 AVE 平均化ブロック、 BEL 第１積和ブロック、 BER 第２積和ブロック、 Div/ADD 除算・加減算ブロック、 Calc/LPF 第３積和ブロック、 Diff 第４の積和ブロック。 PE processing arithmetic unit, ADD1, ADD2 addition block, ACC accumulation block, AVE averaging block, BEL first product-sum block, BER second product-sum block, Div / ADD division / addition / subtraction block, Calc / LPF third product-sum Block, Diff Fourth product-sum block.

Claims

For the input video, the following formula

(Where u and v are optical flows in the x and y directions, n + 1 is the number of repetitions, ave_u and ave_v are average optical flows in the x and y directions, and E_x, E_y, and E_t are x An optical flow processor that obtains a motion vector for each pixel by repeating an operation expressed by a luminance gradient in the, y, and t directions, where α is a weighting coefficient.
It is equipped with a common arithmetic unit that inputs data necessary for computation, performs hierarchical image creation at multiple hierarchical levels with different resolutions, luminance gradient calculation, optical flow derivation and interpolation, and iterative computation for interpolation image creation ,
The common arithmetic unit includes an adder that performs addition on input data, first, second, third, and fourth processing arithmetic units that calculate input data and / or data from the adder, and first to first It consists of an accumulator that adds the operation results of 4 processing operation units,
Each of the first to fourth processing arithmetic units is:
An averaging block that averages the input data;
A first product-sum block that performs a product-sum operation on input data;
A second product-sum block that performs a product-sum operation on input data;
A division / addition / subtraction block for performing division, addition and subtraction of input data from the first product-sum block and / or the second product-sum block;
A third product-sum block that performs a product-sum operation on the input data from the division / addition / subtraction block and the input data from the internal memory;
A fourth product-sum block that performs a product-sum operation on the input data from the third product-sum block and the input data from the internal memory,
The input data is changed according to the type of operation executed sequentially , the averaging block, the first product-sum block, the second product-sum block, the division / addition / subtraction block, and the third product-sum block, The fourth product-sum block is selectively used, the data path is changed , hierarchical images are created at a plurality of hierarchical levels, the luminance gradient is calculated and stored in the luminance gradient memory, and the optical flow is performed for the highest hierarchical level. An interpolation process for deriving and storing in an optical flow memory, converting a higher resolution hierarchical image from an upper level optical flow to a lower level optical flow, and storing the same in the optical flow memory; The interpolated image creation that performs motion compensation using the optical flow is sequentially executed , and the final optical flow is output .
Optical flow processor.

Furthermore, an input buffer for receiving data from the external and output buffers, an internal memory for storing at least a part of the operation result, and an output buffer for storing and outputting the operation result of the common arithmetic unit,
2. The optical flow processor according to claim 1, wherein the common arithmetic unit inputs data necessary for an operation from an input buffer and an internal memory.

The internal memory, the luminance gradient memory, characterized in that it comprises an optical flow memory for storing the optical flow, the optical flow processor of claim 2 for storing the luminance gradient.

2. The optical flow processor according to claim 1, wherein each of the first, second, third, and fourth processing arithmetic units is configured to obtain a luminance gradient E_x, E_y for each of the four parallel pixels when the optical flow is derived. , E_t and the optical flow befor_u, befor_v of the previous frame,
The averaging block calculates average optical flows ave_u and ave_v,
The first sum-of-products block calculates out_bel = E_x ² + E_y ² + α ² ,
The second product-sum block calculates out_ber = E_x * ave_u + E_y * ave_v + E_t,
The division / addition / subtraction block calculates div_add = out_ber / out_bel,
The third product-sum block calculates u = ave_u-E_x * div_add and v = ave_u-E_y * div_add,
The fourth product-sum block calculates tmp = (befor_u-u) ² + (befor_v-v) ² .

2. The optical flow processor according to claim 1, wherein the first, second, third, and fourth processing computing units are configured to calculate luminance values E_mx, LPF filters of a previous frame, a current frame, and a next frame when calculating a luminance gradient. Enter the coefficient lpf_x and the diff filter coefficient diff_x (where x = l, m, n)
In the first processing arithmetic unit,
The first product-sum block calculates out_bel = E_ml * lpf0 + E_mm * lpf1 + E_mn * lpf0,
The second product-sum block calculates out_ber = E_ml * diff0 + E_mm * diff1 + E_mn * diff0,
The third sum-of-products block calculates Tmp_lpf = (E_l + E_n) * lpf0 + E_m * lpf1 in the y direction,
In the second processing arithmetic unit,
The first product-sum block calculates out_bel = E_ml * lpf0 + E_mm * lpf1 + E_mn * lpf0,
The second product-sum block calculates out_ber = E_ml * diff0 + E_mm * diff1 + E_mn * diff0,
The division / addition / subtraction block outputs out_bel and out_ber,
The third product-sum block inputs out_bel and out_ber from the first, second, and third processing calculators, calculates Tmp_lpf = (E_l + E_n) * lpf0 + E_m * lpf1 in the x direction,
The fourth product-sum block calculates diff_0 = diff0 * (E_i-E_k) and diff_1 = diff0 * (E_j-E_g),
In the third processing arithmetic unit,
The first product-sum block calculates out_bel = E_ml * lpf0 + E_mm * lpf1 + E_mn * lpf0,
The second product-sum block calculates out_ber = E_ml * diff0 + E_mm * diff1 + E_mn * diff0,
The division / addition / subtraction block outputs out_bel and out_ber,
The third product-sum block inputs out_bel and out_ber from the first, second, and third processing calculators, calculates Tmp_lpf = (E_l + E_n) * lpf0 + E_m * lpf1 in the x direction,
The fourth product-sum block inputs LPF coefficients and calculates tmp = (E_i + E_k) * lpf0 + (E_j + E_g) * Lpf1
In the fourth processing arithmetic unit,
The third product-sum block calculates Tmp_lpf = (E_l + E_n) * lpf0 + E_m * lpf1 in the y direction.

An optical flow processor according to any one of claims 1 to 5,
A plurality of the common arithmetic units are arranged in parallel,
Wiring to share input data in two adjacent processing arithmetic units between adjacent common arithmetic units;
An optical flow processor, wherein the previous optical flow data is transferred from another common arithmetic unit to a common arithmetic unit at one end of the plurality of common arithmetic units.