JP4929474B2

JP4929474B2 - Image processing device

Info

Publication number: JP4929474B2
Application number: JP2006263627A
Authority: JP
Inventors: 学中尾
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2006-09-27
Filing date: 2006-09-27
Publication date: 2012-05-09
Anticipated expiration: 2026-09-27
Also published as: JP2008084034A

Abstract

PROBLEM TO BE SOLVED: To construct an apparatus for quickly processing image correlation operation and space filters and quickly executing a plurality of small image operations simultaneously or combinedly, regarding an image processing apparatus for performing image correlation operation. SOLUTION: The image processing apparatus is provided with: a shared image input bus for transferring data composed of a plurality of pixels from a memory in one cycle; a means for selecting data of specified vertical and horizontal sizes from specified horizontal and vertical start points out of the data of the plurality of pixels transferred to the shared image input bus; an input buffer for storing the selected data; a systolic array for capturing data from the input buffer and performing operation thereof; a synchronizing memory for synchronizing results operated by the systolic array; and a means for pipeline-processing the operation or peak extraction of a plurality of synchronized operation results. COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、画像の相関演算を行う画像処理装置に関するものである。 The present invention relates to an image processing apparatus that performs correlation calculation of images.

移動ロボットは、絶えず変化する周辺環境に対応するために、カメラから入力された画像をリアルタイムで処理して周辺環境を把握する必要がある。その画像演算量は膨大であり、多用される相関演算、空間フィルタ、特徴抽出の演算を高速に処理することが求められている。 In order to cope with the constantly changing surrounding environment, the mobile robot needs to process the image input from the camera in real time to grasp the surrounding environment. The amount of image calculation is enormous, and high-speed processing of frequently used correlation calculation, spatial filter, and feature extraction is required.

従来、撮影した画像について、追跡する物体に関する固定の参照画像データをもとに相関値を求め、そのピーク位置を検出して当該物体の追跡を行う技術がある。当該技術において、探索画像データとｍ画素×ｎ画素の参照画像データとの相関演算を、参照画像の探索画像に対する位置を動かしなら複数回行って相関値を出力してその累積を求めることで、等価的にａｍ画素×ｂｎ画素の参照画像に対する参照画像データと探索画像データとの相関値を高速に出力することが行われている（特許文献１）。 Conventionally, there is a technique for obtaining a correlation value of a photographed image based on fixed reference image data relating to an object to be tracked, detecting the peak position, and tracking the object. In the technique, the correlation calculation between the search image data and the reference image data of m pixels × n pixels is performed a plurality of times if the position of the reference image with respect to the search image is moved, and a correlation value is output to obtain the accumulation, Equivalently, a correlation value between reference image data and search image data for a reference image of am pixels × bn pixels is output at high speed (Patent Document 1).

また、上述した前者で相関演算などを高速に処理するためにＰＥ（プロセッサエレメント）を２次元アレイ状に並べたシストリックアレイでパイプライン状に並列演算を行うＬＳＩによって高速処理を行う技術について、図１４および図１５を用いて簡単に説明する。 In addition, with regard to the technique for performing high-speed processing by LSI that performs parallel operation in a pipelined manner with a systolic array in which PE (processor elements) are arranged in a two-dimensional array in order to process correlation operations at high speed in the former, This will be briefly described with reference to FIGS. 14 and 15.

図１４の（ａ）が従来技術の画像処理装置の全体構成である。ビデオ入出力回路ではビデオ入力とビデオ出力の制御、メモリ制御回路は外部メモリとの制御を扱い、ビデオ入力画像を外部メモリに取り込む。外部からコマンドを与えると動作制御回路からの要求により、外部メモリから画像データが並列演算回路に転送する。並列演算回路は図１４の（ｂ）の並列演算回路のようにＰＥを２次元アレイ状に並べた構成となっており、画像データを左下のＰＥｎ１に入力して、ＰＥｎ１→ＰＥｎ２→ＰＥｎ３へと,パイプライン状にデータを流していくことによって演算を行う回路であり、下記式で表現される画像のＳＡＤ相関演算の演算結果を出力する。 FIG. 14A shows the overall configuration of a conventional image processing apparatus. The video input / output circuit handles video input and video output control, and the memory control circuit handles control with an external memory, and captures a video input image into the external memory. When a command is given from the outside, image data is transferred from the external memory to the parallel operation circuit in response to a request from the operation control circuit. The parallel arithmetic circuit has a configuration in which PEs are arranged in a two-dimensional array like the parallel arithmetic circuit of FIG. 14B, and image data is input to the lower left PEn1, and PEn1 → PEn2 → PEn3. , A circuit that performs an operation by flowing data in a pipeline, and outputs a calculation result of an SAD correlation operation of an image expressed by the following equation.

この構成では複数のＰＥで並列演算を行っており、相関演算の処理を高速に行うことができる。
特開平１０−０４０３９２号公報 In this configuration, parallel computation is performed by a plurality of PEs, and correlation computation processing can be performed at high speed.
JP-A-10-040392

ロボットの自律移動などリアルタイムで周辺環境の把握を行う場合には、前者でも性能が不十分であり、更に、後者の従来技術でも性能が十分であるとはいえず、更なる性能向上が要求される。 When grasping the surrounding environment in real time such as autonomous movement of robots, the former is insufficient in performance, and the latter conventional technology is not sufficient in performance, and further performance improvement is required. The

近年の半導体プロセスの微細化により多くの回路をＬＳＩに実装することができるようになり、性能を向上させるためにより多くのＰＥを実装して処理することが考えられる。 With the recent miniaturization of semiconductor processes, more circuits can be mounted on LSIs, and it is conceivable that more PEs are mounted and processed in order to improve performance.

しかし、上述した従来技術のアレイ構成で要素数を単純に例えば８ｘ８から１６ｘ１６にして４倍にしても、１６ｘ１６の大きな参照領域に対する相関演算を実行させる場合、演算性能の向上は単純に４倍とはならない。 However, even if the number of elements is simply increased from, for example, 8 × 8 to 16 × 16 in the above-described conventional array configuration, when the correlation calculation is performed on a large reference area of 16 × 16, the improvement in the calculation performance is simply four times. Must not.

例えば、８ｘ８のアレイ構成で図１４の（ｃ）の参照画像８ｘ８、図１４の（ｄ）の探索範囲１６ｘ１６の相関演算を実行した場合のタイミングチャートを図１４の（ｅ）に示す。この演算では、図１４の（ｅ）に示すように、参照画像のロードに６４サイクル、探索画像のロードに５２９サイクル、演算結果の出力に１０サイクルを要しており、合計６０３サイクルの処理時間となる。参照画像のサイズを１６ｘ１６として演算する場合には、この演算を４回実施することになるので２４１２サイクルの処理時間になる。 For example, FIG. 14E shows a timing chart when correlation calculation is performed on the reference image 8x8 in FIG. 14C and the search range 16 × 16 in FIG. 14D with an 8 × 8 array configuration. In this calculation, as shown in FIG. 14 (e), it takes 64 cycles to load the reference image, 529 cycles to load the search image, and 10 cycles to output the calculation result, and the total processing time is 603 cycles. It becomes. When the calculation is performed with the reference image size set to 16 × 16, this calculation is performed four times, so that the processing time is 2412 cycles.

それに対し、１６ｘ１６のアレイ構成で、図１５の（ａ）の参照画像１６ｘ１６、探索範囲１６ｘ１６の相関演算を実行した場合のタイミングチャートを図１５の（ｃ）に示す。この演算では、図１５の（ｃ）に示すように、参照画像のロードに２５６サイクル、探索画像のロードに９６１サイクル、演算結果の出力に１０サイクルを要しており、合計１２２７サイクルの処理時間となる。４倍の数のＰＥを使用しても性能向上は２倍程度と、それほど増加しない。 On the other hand, FIG. 15C shows a timing chart when the correlation calculation of the reference image 16x16 and the search range 16x16 in FIG. 15A is executed in the 16 × 16 array configuration. In this calculation, as shown in FIG. 15 (c), it takes 256 cycles to load the reference image, 961 cycles to load the search image, and 10 cycles to output the calculation result, and a total processing time of 1227 cycles. It becomes. Even if 4 times the number of PEs are used, the performance improvement does not increase so much, about twice.

この理由は、シストリックアレイはデータを入力ポートから１画素ずつデータをロードして処理するため、画像データロード時間が演算処理時間を決定している。そのため、ＰＥの数を増やした分の性能向上とはなっておらずそれぞれのＰＥの動作率を向上させる必要があるという問題点があった。 This is because the systolic array loads and processes data pixel by pixel from the input port, so the image data load time determines the processing time. For this reason, there is a problem that the performance is not improved by increasing the number of PEs, and the operation rate of each PE needs to be improved.

また、画像認識では小さい画像サイズの演算が複数利用されることが多く、例えば特徴抽出では７ｘ７の２つの空間フィルタを使用して輝度勾配画像を生成し、その共分散行列を求める処理をしばしば行う。このような小さいサイズの画像演算を複数同時実行し、演算結果をカスケード結合することができる方式が求められている。 In image recognition, a plurality of operations with a small image size are often used. For example, in feature extraction, a luminance gradient image is generated using two 7 × 7 spatial filters, and a process for obtaining a covariance matrix is often performed. . There is a need for a method capable of simultaneously executing a plurality of such small-size image operations and cascading the operation results.

本発明は、これらの問題を解決するため、画像の相関演算、空間フィルタを高速に処理することが可能であり、更に、小さな画像演算を複数同時ないしは結合して高速に実行するハードウエアを構築するようにしている。 In order to solve these problems, the present invention can process image correlation operations and spatial filters at a high speed, and further constructs hardware for executing a plurality of small image operations simultaneously or in combination. Like to do.

請求項１は、画像メモリから複数画素のデータを1サイクルで転送可能な共有画像入力バス、各シストリックアレイへの入力データのうち、指定された水平/垂直開始地点から指定された水平/垂直サイズのデータを選択する手段、各シストリックアレイの演算結果を同期化するための手段、同期化されたシストリックアレイの演算結果同士の演算またはピーク抽出をパイプライン処理する手段を設けたものである。共有画像入力バス上で複数画素分のデータ送信を行って個々のシストリックアレイが必要なデータを入力バッファに取り込むことで、データ待ち時間を短縮し、さらに個々のシストリックアレイが入力バッファにからそれぞれのタイミングでデータを読み出して処理した結果に対して同期化する手段を設けることで、異なるタイミングで出てくる演算結果同士のタイミングを合わせて演算することが可能となる。 Claim 1 is a shared image input bus capable of transferring data of a plurality of pixels from an image memory in one cycle, and horizontal / vertical specified from a specified horizontal / vertical start point among input data to each systolic array. A means for selecting size data, a means for synchronizing the operation results of each systolic array, and a means for performing pipeline processing for calculating or peak extraction of the operation results of synchronized systolic arrays. is there. By transmitting data for multiple pixels on the shared image input bus and fetching the data required by each systolic array into the input buffer, the data waiting time is reduced, and each systolic array is stored in the input buffer. By providing means for synchronizing the results of reading and processing the data at each timing, it becomes possible to perform the calculation in synchronism with the timings of the calculation results appearing at different timings.

請求項２は、各シストリックアレイへの画像入力データとして、共有画像入力バスからのデータとシストリックアレイの演算結果のデータから各シストリックアレイへの入力データを選択する手段と、各シストリックアレイの演算結果後に除算器を設けるものである。シストリックアレイと除算器を使った空間フィルタの演算結果を後段にシストリックアレイの入力として使用することが可能となる。 According to a second aspect of the present invention, as image input data to each systolic array, means for selecting input data to each systolic array from data from the shared image input bus and data of a calculation result of the systolic array, and each systolic array A divider is provided after the calculation result of the array. The calculation result of the spatial filter using the systolic array and the divider can be used as the input of the systolic array in the subsequent stage.

請求項３は、複数のシストリックアレイを用いて、大きな参照領域との相関演算を行うため、請求項４は空間フィルタをカスケード接続して演算するため、請求項５は正規化相関演算をおこなうため、請求項６は特徴抽出処理を行うためのものである。 Claim 3 performs correlation calculation with a large reference region using a plurality of systolic arrays, Claim 4 performs calculation by cascading spatial filters, and Claim 5 performs normalized correlation calculation Therefore, claim 6 is for performing the feature extraction processing.

従来のシストリックアレイが個々にデータをとってくる場合、個々のデータ転送バスにメモリを用意し、各々のメモリに同じデータを書き込む必要があり、必要なメモリの容量並びに消費電力が大きくなってしまう。 When a conventional systolic array takes data individually, it is necessary to prepare a memory for each data transfer bus and write the same data to each memory, which increases the required memory capacity and power consumption. End up.

本発明では、個々のシストリックアレイが必要とする画像領域がオーバラップしていることを考慮して、共有画像入力バス上に複数画素分のデータ送信を行って個々のシストリックアレイが必要なデータを入力バッファに取り込むことで、単一のメモリを使って必要なメモリ容量を抑えながらもデータ待ち時間を短縮による処理時間の短縮を図ることを可能にしている。 In the present invention, considering that the image areas required by the individual systolic arrays overlap, data transmission for a plurality of pixels is performed on the shared image input bus, and the individual systolic arrays are required. By taking the data into the input buffer, it is possible to reduce the processing time by reducing the data waiting time while reducing the necessary memory capacity using a single memory.

更に、個々のシストリックアレイが入力バッファからそれぞれのタイミングでデータを読み出して処理した結果に対して同期化する手段（同期化メモリ）を設けることで、異なるタイミングで出てくる演算結果同士のタイミングを合わせて演算することを可能にし、かつパイプライン処理で評価値の演算並びにピーク抽出を行うことを可能にしている。 Furthermore, by providing means (synchronization memory) for synchronizing the results obtained by reading the data from the input buffer at each timing by each systolic array and processing them, the timings of the computation results appearing at different timings It is possible to calculate the evaluation value and extract the peak value by pipeline processing.

例えば６４個のＰＥで構成されたシストリックアレイ１個で１６ｘ１６サイズの相関演算を行う場合と比較して、本発明の構成で６４個のＰＥで構成されたシストリックアレイ４個で１６ｘ１６サイズの相関演算を行う場合は３．４倍の性能向上となる。 For example, compared to a case where a 16 × 16 size correlation calculation is performed with one systolic array composed of 64 PEs, a 16 × 16 size composed of four systolic arrays composed of 64 PEs in the configuration of the present invention. When performing the correlation calculation, the performance is improved by 3.4 times.

また、請求項２の発明では、シストリックアレイの演算結果を他のシストリックアレイの入力することでカスケード結合して処理する手段を設けており、複数の演算が必要な処理を１回のデータ入力で処理することが可能となる。 According to the invention of claim 2, there is provided means for processing the results of the systolic array by cascading them by inputting to another systolic array, and processing that requires a plurality of operations is performed once in the data. It becomes possible to process by input.

本発明は、個々のシストリックアレイが必要とする画像領域がオーバラップしていることを考慮して、共有画像入力バス上に複数画素分のデータ送信を行って個々のシストリックアレイが必要なデータを入力バッファに取り込むことで、単一のメモリを使って必要なメモリ容量を抑えながらもデータ待ち時間を短縮による処理時間の短縮を実現したり、個々のシストリックアレイが入力バッファからそれぞれのタイミングでデータを読み出して処理した結果に対して同期化する手段（同期化メモリ）を設け、異なるタイミングで出てくる演算結果同士のタイミングを合わせて演算することを実現、更に、パイプライン処理で評価値の演算並びにピーク抽出を行うことを実現したりした。 In consideration of the fact that the image areas required by the individual systolic arrays overlap, the present invention requires individual systolic arrays by transmitting data for a plurality of pixels on the shared image input bus. By capturing data into the input buffer, a single memory can be used to reduce the processing time by reducing the data waiting time while reducing the required memory capacity. A means (synchronization memory) that synchronizes the results of reading and processing the data at the timing is provided, and it is possible to perform the calculation by matching the timings of the calculation results that are output at different timings. The calculation of the evaluation value and the peak extraction were realized.

図１は、本発明のシステム構成図を示す。
図１において、画像メモリ制御回路１は、画像メモリ２に画像（例えばロボットに装着したカメラで撮影した画像）を格納したり、画像について複数画素を１サイクルで共有画像入力バス３に出力したり（図２参照）などするものである。 FIG. 1 shows a system configuration diagram of the present invention.
In FIG. 1, an image memory control circuit 1 stores an image (for example, an image taken by a camera attached to a robot) in an image memory 2, or outputs a plurality of pixels to the shared image input bus 3 in one cycle. (See FIG. 2).

画像メモリ２は、カメラで撮影した画像などを一時的に保持するものである。
共有画像入力バス３は、画像メモリ２から複数画素を１サイクルで送出し、入力バッファ４１に所定画像をそれぞれ取り込ませるためのバスである（図２参照）。図２で説明する実施例では、共有画像入力バス３は例えば４画素のデータを１サイクルで転送できるように構成されている（図２参照）。 The image memory 2 temporarily holds an image taken by the camera.
The shared image input bus 3 is a bus for sending out a plurality of pixels from the image memory 2 in one cycle, and causing the input buffer 41 to take in each predetermined image (see FIG. 2). In the embodiment described with reference to FIG. 2, the shared image input bus 3 is configured to transfer, for example, 4-pixel data in one cycle (see FIG. 2).

並列演算回路４は、共有画像入力バス３から指定された水平/垂直開始地点から指定された水平/垂直サイズのデータを選択する図示外の制御回路を設けたり、共有画像入力バス３を介して取り込んだ画像を並列演算（相関演算など）を行わせたりなどするものであって、ここでは、図示の、入力バッファ４１、シストリックアレイ４２、除算器４３などから構成されるものである。 The parallel arithmetic circuit 4 is provided with a control circuit (not shown) for selecting horizontal / vertical size data designated from the horizontal / vertical start point designated from the shared image input bus 3 or via the shared image input bus 3. The acquired image is subjected to parallel calculation (correlation calculation or the like). Here, the image is constituted of an input buffer 41, a systolic array 42, a divider 43, and the like.

入力バッファ４１は、画像メモリ２から１サイクルで送出された複数画素を共有画像入力バス３を介し、自己宛の画像を選択して取り込んで一時的に保持するものである（図２参照）。 The input buffer 41 selects and fetches a plurality of pixels sent from the image memory 2 in one cycle via the shared image input bus 3 and temporarily stores an image addressed to itself (see FIG. 2).

シストリックアレイ４２は、ＰＥ（プロセッサエレメント）を多数配置した並列演算回路（例えば既述した図１５の（ｂ）参照）であって、高速演算する回路である。 The systolic array 42 is a parallel arithmetic circuit (for example, see FIG. 15B described above) in which a large number of PEs (processor elements) are arranged, and is a circuit that performs high-speed arithmetic.

除算器４３は、シストリックアレイ４２で演算した結果を、除算するものである。
セレクタ５は、共有画像入力バス３上の画像（データ）あるいは除算器４３でシストリックアレイ４２で演算した結果を除算した画像（データ）のうちの、指定されたデータを選択して入力バッファ４１に格納するものである。 The divider 43 divides the result calculated by the systolic array 42.
The selector 5 selects specified data from the image (data) on the shared image input bus 3 or the image (data) obtained by dividing the result calculated by the systolic array 42 by the divider 43 to select the input buffer 41. To be stored.

同期化メモリ６は、シストリックアレイ４２で演算した結果（あるいは更に、除算器４３で除算した結果）について、他のシストリックアレイ４２で演算した結果と演算を行うための同期化を行うためのメモリである。 The synchronization memory 6 synchronizes the result calculated by the systolic array 42 (or the result of further division by the divider 43) with the result calculated by the other systolic array 42. It is memory.

パイプライン演算回路７は、同期化メモリ６でそれぞれシストリックアレイ４２の演算結果（あるいは更に除算した結果）を同期化した後のデータについて、パイプラインで高速演算、ここでは、同期化されたシストリックアレイ４２の演算結果同士の加算並びにピーク抽出をパイプライン処理などするものである。 The pipeline operation circuit 7 performs high-speed operation on the data after synchronizing the operation result (or the result of further division) of the systolic array 42 in the synchronization memory 6, in this case, the synchronized system The addition of the calculation results of the trick array 42 and the peak extraction are performed by pipeline processing.

演算結果メモリ８は、パイプライン演算回路７で演算した結果を格納するメモリである。 The calculation result memory 8 is a memory for storing the result calculated by the pipeline calculation circuit 7.

図２は、本発明の説明図（図１）を示す。これは、各シストリックアレイ４２の動作を示す。４つの図示のシストリックアレイＡ，Ｂ、Ｃ，Ｄ（図１のシストリックアレイＡ，Ｂ，Ｃ，Ｄに対応する）４２は１６ｘ１６の参照画像を４分割した８ｘ８の部分参照画像領域を割り当てる（図２の（ａ）参照）。共有画像入力バス３に流れてくる１６ｘ１６の画像領域からそれぞれが担当する８ｘ８の部分参照画像領域をまず取り込む（図２の（ａ）のＡ，Ｂ，Ｃ，Ｄの８ｘ８のいずれかの該当するものをそれぞれを１つ入力バッファ４１に取り込む）。そして探索画像をロードする際にはそれぞれの部分領域の演算に必要な探索画像領域（図２の（ｂ）の（ｂ−１）のＡ，（ｂ−２）のＢ，（ｂ−３）のＣ，（ｂ−４）のＤのいずれか１つの領域のデータを入力バッファ４１にそれぞれ取り込み、それぞれで相関演算を各シストリックアレイ４２でそれぞれ実行する。各シストリックアレイ４２からの出力データのタイミングは異なるので、同期化メモリ６でそれぞれ同期化後、全データをパイプライン演算回路７で加算して演算結果の出力を演算結果メモリ８に行う。 FIG. 2 is an explanatory diagram (FIG. 1) of the present invention. This shows the operation of each systolic array 42. Four illustrated systolic arrays A, B, C, and D (corresponding to systolic arrays A, B, C, and D in FIG. 1) 42 allocate 8 × 8 partial reference image areas obtained by dividing a 16 × 16 reference image into four. (See (a) in FIG. 2). First, an 8 × 8 partial reference image area in charge of each of the 16 × 16 image areas flowing in the shared image input bus 3 is captured (corresponding to any of 8 × 8 of A, B, C, and D in FIG. 2A). Each one is taken into the input buffer 41). When the search image is loaded, the search image regions (A of (b-1) in FIG. 2B, B of (b-2), (b-3) required for the calculation of each partial region are loaded. C and D of (b-4) are respectively fetched into the input buffer 41, and correlation operations are respectively executed in each systolic array 42. Output data from each systolic array 42 Are synchronized by the synchronization memory 6, and all data are added by the pipeline operation circuit 7 and the operation result is output to the operation result memory 8.

本実施例では図２の（ｃ）のロード／演算フローに示すように、参照画像のロードに６４サイクル、探索画像のロードに２４８サイクルそれぞれの時間を要す。最もタイミングの遅いシストリックアレイＤの演算結果の出力は参照画像ロード後から５９５サイクル後となり、全演算結果の出力はその１４サイクル後となる。演算開始から全結果の演算出力までは７０９サイクルとなり、性能向上は約３．４倍となる。これは、従来の図１５、図１６の構成の場合には８×８から１６×１６に画素数を４倍に増やしても、約２倍ほどしか性能向上ができないのに比し、本実施例の図１、図２の構成によれば、８×８から１６×１６に画素数を４倍に増やした場合に４倍に近い約３．４倍という大幅な性能向上を達成できた。 In this embodiment, as shown in the load / computation flow of FIG. 2C, it takes 64 cycles for loading the reference image and 248 cycles for loading the search image. The output of the calculation result of the systolic array D having the latest timing is 595 cycles after the reference image is loaded, and the output of all the calculation results is 14 cycles after that. From calculation start to calculation output of all results is 709 cycles, and the performance improvement is about 3.4 times. This is because, in the case of the conventional configuration of FIGS. 15 and 16, even if the number of pixels is increased by 4 times from 8 × 8 to 16 × 16, the performance can be improved only by about 2 times. According to the configuration of FIG. 1 and FIG. 2 of the example, when the number of pixels is increased 4 times from 8 × 8 to 16 × 16, a significant performance improvement of about 3.4 times close to 4 times can be achieved.

図２の（ａ）は参照画像１６×１６をＡ，Ｂ，Ｃ，Ｄの４つの８×８画素に分割した様子を示す。ロード時間は、図示のように、
・３４（アレイＤ参照画像ロード待ち時間）＋６４（アレイＤ参照画像ロード時間）＋６６（アレイＤ探索画像ロード待ち時間）＋５２９（アレイＤ探索画像ロード時間）＋１４（シストリックアレイ４２の演算処理遅延）＝７０９サイクル
となる（各数字は、図２の（ｃ）のロード／演算フローの対応する項目とその数字参照）。 FIG. 2A shows a state in which the reference image 16 × 16 is divided into four 8 × 8 pixels A, B, C, and D. The load time is
34 (array D reference image loading waiting time) +64 (array D reference image loading waiting time) +66 (array D searching image loading waiting time) +529 (array D searching image loading waiting time) +14 (calculation delay of systolic array 42) = 709 cycles (see the corresponding items in the load / computation flow in FIG. 2C and their numbers for each number).

ここで、５２９は、図２の（ｂ）のＡ、Ｂ、Ｃ、Ｄの領域のサイズ２３×２３の乗算結果である。 Here, 529 is a multiplication result of size 23 × 23 in the areas A, B, C, and D in FIG.

図２の（ｂ）は、シストリックアレイ４２の例を示す。図２の（ｂ−１）はシストリックアレイＡの探索画像の例、図２の（ｂ−２）はシストリックアレイＢの探索画像の例、図２の（ｂ−３）はシストリックアレイＣの探索画像の例、図２の（ｂ−４）はシストリックアレイＤの探索画像の例をそれぞれ示す。ここでは、Ａ，Ｂ，Ｃ，Ｄの各探索領域は、２３×２３画素である。 FIG. 2B shows an example of the systolic array 42. 2B-1 is an example of a search image of the systolic array A, FIG. 2B-2 is an example of a search image of the systolic array B, and FIG. 2B-3 is a systolic array. An example of a search image of C, (b-4) of FIG. 2 shows an example of a search image of the systolic array D, respectively. Here, each search area of A, B, C, and D is 23 × 23 pixels.

図２の（ｃ）は、図２の（ａ）の参照画像Ａ，Ｂ，Ｃ，Ｄ，および図２の（ｂ）の探索画像Ａ，Ｂ，Ｃ，Ｄをそれぞれ画像メモリ２から共有画像入力バス３を介して各入力バッファ４１にロードするときのフロー（波形）を示す。ここで、シストリックアレイＣの波形図（フロー）は図２の（ｃ）では省略した（他のＢ，Ｄと同じである）。 FIG. 2C shows the reference images A, B, C, and D in FIG. 2A and the search images A, B, C, and D in FIG. A flow (waveform) when loading each input buffer 41 via the input bus 3 is shown. Here, the waveform diagram (flow) of the systolic array C is omitted in (c) of FIG. 2 (the same as other B and D).

図３は、本発明のシステム構成図（その２）を示す。図３は、空間フィルタを多重に処理あるいは画像同士の差分画像を生成する場合のシステム構成例を示す。詳述すれば、各シストリックアレイ４２への画像入力データとして、共有画像入力バス３からのデータと、シストリックアレイ４２の演算結果のデータとから各シストリックアレイ４２への入力データを選択するセレクタ５と、各シストリックアレイ４２の演算結果の後に除算器４３がついている。シストリックアレイ４２から出力された演算結果は除算器４３にて除算を行って空間フィルタの出力を演算し、その出力を同期化メモリ６およびに次段のシストリックアレイ４２に送信する。次段のシストリックアレイ４２は同様に空間フィルタの演算を行い、更に次段のシストリックアレイ４２へと演算結果を送信する。このようにカスケード接続することによって空間フィルタを多重に処理することが可能である。 FIG. 3 shows a system configuration diagram (part 2) of the present invention. FIG. 3 shows a system configuration example in the case where multiple spatial filters are processed or a difference image between images is generated. More specifically, as image input data to each systolic array 42, input data to each systolic array 42 is selected from data from the shared image input bus 3 and data of calculation results of the systolic array 42. A divider 43 is attached after the operation result of the selector 5 and each systolic array 42. The calculation result output from the systolic array 42 is divided by the divider 43 to calculate the output of the spatial filter, and the output is transmitted to the synchronization memory 6 and the next-stage systolic array 42. The next-stage systolic array 42 similarly performs a spatial filter operation, and further transmits the operation result to the next-stage systolic array 42. In this way, the spatial filter can be processed in a multiplex manner by cascade connection.

また、各シストリックアレイ４２から出力される演算結果を同期化メモリ６で同期化して差分をとることにより、空間フィルタをかけた結果の画像同士の差分画像の生成を行うことができる。 In addition, by synchronizing the calculation results output from each systolic array 42 with the synchronization memory 6 and taking the difference, it is possible to generate a difference image between the images resulting from the spatial filter.

図３の（ａ）はシステム構成を示し、図３の（ｂ）は既述した図２の（ｃ）に対応するロード／演算フローを表す。シストリックアレイ４２を図３の（ａ）に示すようにカスケード接続した場合には、図３の（ｂ）に示す、参照画像のロード、探索画像のロード、シストリックアレイＡ，Ｂ（Ｃ，Ｄは省略）へのロードが図示のようなサイクルとなる。 3A shows the system configuration, and FIG. 3B shows the load / calculation flow corresponding to FIG. 2C described above. When the systolic array 42 is cascade-connected as shown in FIG. 3A, the reference image loading, search image loading, systolic arrays A and B (C, (D is omitted) is loaded as shown in the cycle.

図４は、本発明のシステム構成図（その３）を示す。これは、参照画像のロード時に FIG. 4 shows a system configuration diagram (part 3) of the present invention. This is when the reference image is loaded

をまず計算する。そして、参照画像ロード時に３つのシストリックアレイ４２では Is calculated first. When the reference image is loaded, the three systolic arrays 42

をそれぞれ計算して、パイプライン演算回路７で Is calculated by the pipeline arithmetic circuit 7.

を計算することによって画像の正規化相関演算を実行する。
これらの様子を図４の（ａ）のシステム構成上の並列演算回路４にそれぞれ記載する。また、図４の（ｂ）に、その演算結果を記載する（数式９）と同じ。 The normalized correlation operation of the image is performed by calculating.
These states are described in the parallel arithmetic circuit 4 in the system configuration of FIG. Further, FIG. 4B is the same as (Formula 9) in which the calculation result is described.

図５は、本発明の参照画像の設定フローチャートを示す。これは、例えば図１で画像メモリ２から参照画像を各入力バッファ４１に設定（ロード）するときの詳細フローチャートである。 FIG. 5 shows a reference image setting flowchart of the present invention. This is a detailed flowchart when setting (loading) a reference image from the image memory 2 to each input buffer 41 in FIG. 1, for example.

図５において、Ｓ１は、画像メモリ制御回路１に転送開始座標（水平／垂直）とサイズ（水平／垂直）を指定する。これは、例えば図１（あるいは図２から図４も同様、以下同じ）の画像メモリ制御回路１に、図示外のホストから参照画像（図２の（ａ）参照）の転送開始座標として、水平および垂直の座標と、そのサイズとして水平および垂直の座標を指定する。この際、図１の共有画像入力バス３は、ここでは、１サイクルで４画素のデータを転送できるものとする。 In FIG. 5, S 1 designates the transfer start coordinates (horizontal / vertical) and size (horizontal / vertical) to the image memory control circuit 1. For example, the image memory control circuit 1 shown in FIG. 1 (or the same applies to FIG. 2 to FIG. 4) is used as a transfer start coordinate of a reference image (see FIG. 2A) from a host not shown. Specify vertical and horizontal coordinates, and horizontal and vertical coordinates as the size. At this time, the shared image input bus 3 in FIG. 1 is assumed to be capable of transferring data of 4 pixels in one cycle.

Ｓ２は、並列演算回路Ａ−Ｄそれぞれに取り込み開始座標（水平／垂直）とサイズ（水平／垂直）を指定する。これは、図１の並列演算回路Ａ．Ｂ，Ｃ，Ｄの入力バッファ４１に、共有画像入力バス３から取り込む参照画像の開始座標として、水平および垂直と、そのサイズとして水平および垂直を指定する。 S2 designates the capture start coordinates (horizontal / vertical) and size (horizontal / vertical) for each of the parallel arithmetic circuits AD. This is because the parallel arithmetic circuit A.1 in FIG. In the B, C, and D input buffers 41, horizontal and vertical are designated as the start coordinates of the reference image fetched from the shared image input bus 3, and horizontal and vertical are designated as the sizes thereof.

Ｓ３は、画像メモリ２から参照画像データを共有画像入力バス３にここでは、１サイクルに４画素づつ送り出す。 In S3, the reference image data is sent from the image memory 2 to the shared image input bus 3 here, four pixels per cycle.

Ｓ４は、取り込み開始座標からサイズ分の範囲にあるデータだけを入力バッファ４２に取り込む。これは、Ｓ３で共有画像入力バス３に送出された参照画像データから、各並列演算回路Ａ，Ｂ，Ｃ，ＤがＳ２で指定された参照画像データのみをそれぞれ取り込んで入力バッファ４１に格納する。これにより、Ａ，Ｂ，Ｃ，Ｄの各入力バッファ４１には、例えば図１の（ａ）のＡ，Ｂ，Ｃ，Ｄの部分の参照画像のみが取り込まれて格納されることとなる。 In S4, only the data within the size range from the capture start coordinates is captured in the input buffer. This is because each of the parallel arithmetic circuits A, B, C and D takes in only the reference image data designated in S2 from the reference image data sent to the shared image input bus 3 in S3 and stores it in the input buffer 41. . Thereby, for example, only the reference images of the portions A, B, C, and D in FIG. 1A are captured and stored in the input buffers 41 of A, B, C, and D, for example.

以上によって、画像メモリ２から参照画像データを１サイクルで４画素を順次共有画像入力バス３に送信し、各並列演算回路Ａ，Ｂ，Ｃ，Ｄでは自己に指定された参照画像のみを取り込んで入力バッファ４１に格納し、図１の（ａ）のＡ，Ｂ，Ｃ，Ｄの参照画像（８ｘ８）がそれぞれ入力バッファ４１に設定されることとなる。 As described above, the reference image data from the image memory 2 is sequentially transmitted to the shared image input bus 3 in one cycle, and each parallel arithmetic circuit A, B, C, D takes in only the reference image designated by itself. The A, B, C, and D reference images (8 × 8) in FIG. 1A are stored in the input buffer 41 and set in the input buffer 41, respectively.

図６は、本発明の探索画像の設定フローチャートを示す。これは、例えば図１で画像メモリ２から探索画像を各入力バッファ４１に設定（ロード）するときの詳細フローチャートである。 FIG. 6 shows a search image setting flowchart of the present invention. This is a detailed flowchart when setting (loading) a search image from the image memory 2 to each input buffer 41 in FIG. 1, for example.

図６において、Ｓ１１は、画像メモリ制御回路１に転送開始座標（水平／垂直）とサイズ（水平／垂直）を指定する。これは、例えば図１（あるいは図２から図４も同様、以下同じ）の画像メモリ制御回路１に、図示外のホストから探索画像（図２の（ｂ）参照）の転送開始座標として、水平および垂直の座標と、そのサイズとして水平および垂直の座標を指定する。この際、図１の共有画像入力バス３は、ここでは、１サイクルで４画素のデータを転送できるものとする。 In FIG. 6, S11 designates the transfer start coordinate (horizontal / vertical) and size (horizontal / vertical) to the image memory control circuit 1. For example, the image memory control circuit 1 shown in FIG. 1 (or the same applies to FIGS. 2 to 4) applies to the horizontal start coordinates of the search image (see FIG. 2B) from a host not shown. Specify vertical and horizontal coordinates, and horizontal and vertical coordinates as the size. At this time, the shared image input bus 3 in FIG. 1 is assumed to be capable of transferring data of 4 pixels in one cycle.

Ｓ１２は、並列演算回路Ａ−Ｄそれぞれに取り込み開始座標（水平／垂直）とサイズ（水平／垂直）を指定する。これは、図１の並列演算回路Ａ．Ｂ，Ｃ，Ｄの入力バッファ４１に、共有画像入力バス３から取り込む探索画像の開始座標として、水平および垂直と、そのサイズとして水平および垂直を指定する。 S12 designates the capture start coordinates (horizontal / vertical) and size (horizontal / vertical) for each of the parallel arithmetic circuits AD. This is because the parallel arithmetic circuit A.1 in FIG. In the B, C, and D input buffers 41, horizontal and vertical are specified as the start coordinates of the search image taken in from the shared image input bus 3, and horizontal and vertical are specified as the sizes thereof.

Ｓ１３は、画像メモリ２から探索画像データを共有画像入力バス３にここでは、１サイクルに４画素づつ送り出す。 In step S13, search image data is sent from the image memory 2 to the shared image input bus 3 in this case, four pixels per cycle.

Ｓ１４は、取り込み開始座標からサイズ分の範囲にあるデータだけを入力バッファ４２に取り込む。これは、Ｓ１３で共有画像入力バス３に送出された探索画像データから、各並列演算回路Ａ，Ｂ，Ｃ，ＤがＳ１２で指定された探索画像データのみをそれぞれ取り込んで入力バッファ４１に格納する。これにより、Ａ，Ｂ，Ｃ，Ｄの各入力バッファ４１には、例えば図１の（ｂ）のＡ，Ｂ，Ｃ，Ｄの部分の探索画像のみが取り込まれて格納されることとなる。 In S 14, only the data within the size range from the capture start coordinates is captured in the input buffer 42. This is because each of the parallel arithmetic circuits A, B, C and D takes in only the search image data designated in S12 from the search image data sent to the shared image input bus 3 in S13 and stores it in the input buffer 41. . As a result, only the search images of the portions A, B, C, and D in FIG. 1B are captured and stored in the input buffers 41 of A, B, C, and D, for example.

以上によって、画像メモリ２から探索画像データを１サイクルで４画素を順次共有画像入力バス３に送信し、各並列演算回路Ａ，Ｂ，Ｃ，Ｄでは自己に指定された探索画像のみを取り込んで入力バッファ４１に格納し、図１の（ｂ）のＡ，Ｂ，Ｃ，Ｄの探索画像（２３ｘ２３）がそれぞれ入力バッファ４１に設定されることとなる。 As described above, the search image data from the image memory 2 is sequentially transmitted to the shared image input bus 3 in one cycle, and each parallel arithmetic circuit A, B, C, D takes in only the search image designated by itself. The search images (23 × 23) of A, B, C, and D in FIG. 1B are stored in the input buffer 41 and set in the input buffer 41, respectively.

図７は、本発明の探索画像の転送説明図（３１×３１画素の場合）を示す。これは、既述した図７のフローチャートで探索画像を共有画像入力バス３に転送する場合の様子を示す。 FIG. 7 is an explanatory diagram of search image transfer (in the case of 31 × 31 pixels) according to the present invention. This shows a state where the search image is transferred to the shared image input bus 3 in the flowchart of FIG.

図７において、垂直座標＝Ｖ、Ｖ＋１、・・・は、画像メモリから探索画像を転送する垂直座標であって、ここでは、３１画素分である。横方向は、探索画像の水平方向の画素数であって、ここでは、３１画素分である。 In FIG. 7, vertical coordinates = V, V + 1,... Are vertical coordinates for transferring a search image from the image memory, and are 31 pixels here. The horizontal direction is the number of pixels in the horizontal direction of the search image, and here it is 31 pixels.

以上のように表現される探索画像（図２の（ｂ−１）、（ｂ−２），（ｂ−３）、（ｂ−４）中のＡ，Ｂ，Ｃ，Ｄの探索画像（２３×２３画素））について、図７の垂直座標＝ｖで水平方向へサイクル１，２，３，４，５，６，７，８、更に、垂直座標＝ｖ＋１のサイクル９，１０・・・・と、１サイクルに４画素づつ画像メモリ２から共有画像入力バス３に順次送出することが可能となる。そして、共有画像入力バス３に送出された図７に示す探索画像は、図１の並列演算回路Ａ，Ｂ，Ｃ，Ｄで自己宛の画像のみを選択的に取り込んで入力バッファ４１にそれぞれ格納することにより、既述した図２の（ｃ）に示す、シストリックアレイＡの探索画像ロード、Ｂ探索画像ロード、Ｃ探索画像ロード、Ｄ探索画像ロードというようにそれぞれ入力バッファ４１に格納されることとなる。尚、ここでは、画像メモリからは３１×３１の領域を送信し、並列演算回路はそれぞれ２３×２３の領域を取り込むようにしている。 Search images expressed as described above (search images A, B, C, and D in (b-1), (b-2), (b-3), and (b-4) in FIG. 2 (23 .Times.23 pixels)) in cycles 1, 2, 3, 4, 5, 6, 7, 8 in the horizontal direction with vertical coordinates = v in FIG. 7, and cycles 9, 10,... With vertical coordinates = v + 1. In addition, it is possible to sequentially send out from the image memory 2 to the shared image input bus 3 every four pixels in one cycle. Then, the search image shown in FIG. 7 sent to the shared image input bus 3 is selectively fetched only by the parallel arithmetic circuits A, B, C and D of FIG. As a result, the search image load of the systolic array A, the B search image load, the C search image load, and the D search image load shown in FIG. It will be. Here, a 31 × 31 area is transmitted from the image memory, and the parallel arithmetic circuit is configured to capture a 23 × 23 area.

図８は、本発明の説明図を示す。これは、図１のシステム構成を用い、相関関数（画像サイズ１６ｘ１６，探索範囲１６ｘ１６を実施する場合の相関関数）を計算する様子を模式的に表す。図中で、
・Ｓは探索画像
・Ｒは参照画像
・並列演算回路Ａ，Ｂ，Ｃ，Ｄは図１の並列演算回路Ａ，Ｂ，Ｃ，Ｄ
をそれぞれ表す。 FIG. 8 is an explanatory diagram of the present invention. This schematically represents a state in which a correlation function (correlation function in the case of implementing the image size 16 × 16 and the search range 16 × 16) is calculated using the system configuration of FIG. In the figure,
S is a search image R is a reference image Parallel processing circuits A, B, C, and D are parallel processing circuits A, B, C, and D in FIG.
Respectively.

ここで、既述した探索画像、参照画像をそれぞれ画像メモリ２から各並列演算回路Ａ，Ｂ，Ｃ，Ｄの入力バッファ４１にそれぞれ格納した後（図１、図２、図５から図７参照）、シストリックアレイＡ，Ｂ，Ｃ，Ｄで、図８の（ａ）の式で表す相関演算をそれぞれ実行し、図８の（ｂ）の結果を算出する。 Here, after the search image and the reference image described above are stored in the input buffer 41 of each of the parallel arithmetic circuits A, B, C, and D from the image memory 2, respectively (see FIGS. 1, 2, and 5 to 7). ), And the systolic arrays A, B, C, and D respectively execute the correlation calculation represented by the equation (a) in FIG. 8 to calculate the result (b) in FIG. 8.

図９は、本発明の説明図を示す。これは、図１のシステム構成を用い、フィルタ演算（フィルタサイズ１６ｘ１６，処理範囲１６ｘ１６を実施する場合のフィルタ演算）を実行する様子を模式的に表す。図中で、
・Ｋはフィルタ
・Ｉは処理対象となる画像
・並列演算回路Ａ，Ｂ，Ｃ，Ｄは図１の並列演算回路Ａ，Ｂ，Ｃ，Ｄ
をそれぞれ表す。 FIG. 9 is an explanatory diagram of the present invention. This schematically represents a state in which the system configuration of FIG. 1 is used to perform a filter operation (filter operation when the filter size 16 × 16 and the processing range 16 × 16 are performed). In the figure,
-K is a filter-I is an image to be processed-Parallel arithmetic circuits A, B, C, D are parallel arithmetic circuits A, B, C, D in FIG.
Respectively.

ここで、既述したフィルタ（１６×１６画素）、処理範囲（１６×１６画素）をそれぞれ画像メモリ２から各並列演算回路Ａ，Ｂ，Ｃ，Ｄの入力バッファ４１にそれぞれ格納した後（図１、図２、図５から図７参照）、シストリックアレイＡ，Ｂ，Ｃ，Ｄで、図９の（ａ）の式で表すフィルタ演算をそれぞれ実行し、図９の（ｂ）の結果を算出する。 Here, after storing the filter (16 × 16 pixels) and the processing range (16 × 16 pixels) described above from the image memory 2 in the input buffers 41 of the parallel arithmetic circuits A, B, C, and D, respectively (see FIG. 1, FIG. 2, and FIGS. 5 to 7), the systolic arrays A, B, C, and D respectively execute the filter operation represented by the expression of FIG. 9A to obtain the result of FIG. 9B. Is calculated.

図１０は、本発明のフィルタ係数／除算の設定フローチャートを示す。これは、既述した図９のフィルタ演算を行う際に、フィルタ係数を入力バッファ２１に設定および除数Ｃ１、Ｃ２、Ｃ３、Ｃ４を除算器４３に設定するときのフローチャートを示す。 FIG. 10 shows a flowchart for setting the filter coefficient / division of the present invention. This shows a flowchart for setting the filter coefficient in the input buffer 21 and setting the divisors C1, C2, C3, and C4 in the divider 43 when performing the filter operation of FIG.

図１０において、Ｓ２１は、画像メモリ制御回路１に転送開始座標（水平／垂直）とサイズ（水平／垂直）を指定する。これは、例えば図１の画像メモリ制御回路１に、図示外のホストからフィルタ係数の転送開始座標として、水平および垂直の座標と、そのサイズとして水平および垂直の座標を指定する。この際、図１の共有画像入力バス３は、１サイクルで４画素のデータを転送できるものとする。 In FIG. 10, in S21, transfer start coordinates (horizontal / vertical) and size (horizontal / vertical) are designated to the image memory control circuit 1. For example, the horizontal and vertical coordinates and the horizontal and vertical coordinates are designated as the filter coefficient transfer start coordinates from a host (not shown) to the image memory control circuit 1 in FIG. At this time, it is assumed that the shared image input bus 3 in FIG. 1 can transfer data of four pixels in one cycle.

Ｓ２２は、並列演算回路Ａ−Ｄそれぞれに取り込み開始座標（水平／垂直）とサイズ（水平／垂直）を指定する。これは、図１の並列演算回路Ａ．Ｂ，Ｃ，Ｄの入力バッファ４１に、共有画像入力バス３から取り込むフィルタ係数の開始座標として、水平および垂直と、そのサイズとして水平および垂直を指定する。 In S22, the acquisition start coordinates (horizontal / vertical) and size (horizontal / vertical) are designated for each of the parallel arithmetic circuits AD. This is because the parallel arithmetic circuit A.1 in FIG. In the B, C, and D input buffers 41, horizontal and vertical are specified as the start coordinates of the filter coefficients to be fetched from the shared image input bus 3, and horizontal and vertical are specified as the sizes thereof.

Ｓ２３は、画像メモリ２からフィルタ係数Ｋ１、Ｋ２、Ｋ３、Ｋ４を共有画像入力バス３にここでは、１サイクルに４画素づつ送り出す。 In step S23, the filter coefficients K1, K2, K3, and K4 are sent from the image memory 2 to the shared image input bus 3 here, four pixels per cycle.

Ｓ２４は、取り込み開始座標からサイズ分の範囲にあるデータだけを入力バッファ４２に取り込む。これは、Ｓ２３で共有画像入力バス３に送出されたフィルタ係数Ｋから、各並列演算回路Ａ，Ｂ，Ｃ，ＤがＳ２２で指定されたフィルタ係数のみをそれぞれ取り込んで入力バッファ４１に格納する。これにより、Ａ，Ｂ，Ｃ，Ｄの各入力バッファ４１にはそれぞれのフィルタ係数のみが取り込まれて格納されることとなる。 In S24, only the data in the size range from the capture start coordinates is captured in the input buffer. This is because each of the parallel arithmetic circuits A, B, C, and D takes in only the filter coefficient designated in S22 from the filter coefficient K sent to the shared image input bus 3 in S23 and stores it in the input buffer 41. As a result, only the respective filter coefficients are captured and stored in the input buffers 41 of A, B, C, and D.

Ｓ２５は、並列演算回路４に除数Ｃを設定する。これは、図示外のホストが各並列演算回路４の除算回路４３にそれぞれ該当除数Ｃ１、Ｃ２、Ｃ３、Ｃ４を設定する。 In S25, the divisor C is set in the parallel arithmetic circuit 4. This is because a host (not shown) sets the corresponding divisors C1, C2, C3, and C4 in the division circuit 43 of each parallel arithmetic circuit 4, respectively.

以上によって、画像メモリ２からフィルタ係数を１サイクルで４画素を順次共有画像入力バス３に送信し、各並列演算回路Ａ，Ｂ，Ｃ，Ｄでは自己に指定されたフィルタ係数のみを取り込んで入力バッファ４１に格納、および除数Ｃ１、Ｃ２、Ｃ３、Ｃ４を除算器４３にそれぞれ格納し、フィルタ演算処理の準備が完了したこととなる。 As described above, the filter coefficients are sequentially transmitted from the image memory 2 to the shared image input bus 3 in one cycle, and each parallel arithmetic circuit A, B, C, D takes in and inputs only the filter coefficients designated by itself. The data is stored in the buffer 41, and the divisors C1, C2, C3, and C4 are stored in the divider 43, and the preparation for the filter calculation process is completed.

図１１は、本発明の説明図を示す。これは、図３のシステム構成のもとで、フィルタ演算カスケードの様子を模式的に示す。 FIG. 11 is an explanatory diagram of the present invention. This schematically shows the state of the filter operation cascade under the system configuration of FIG.

（１）入力画像ＩＯ（アイゼロ）について、フィルタＡのフィルタ演算処理を図１の並列演算回路Ａで行い、その結果（除算器４３の出力）である画像Ｉ１を生成する（演算は、下段の式で表される演算を行う）。 (1) For the input image IO (eye zero), the filter arithmetic processing of the filter A is performed by the parallel arithmetic circuit A in FIG. 1, and the result (output of the divider 43) is generated as an image I1 (the arithmetic is performed in the lower stage). Performs the operation represented by the expression).

（２）（１）で生成した画像Ｉ１を更に、フィルタＢのフィルタ演算処理を図１の並列演算回路Ｂで行い、その結果（除算器４３の出力）である画像１２を生成する（演算は、下段の式で表される演算を行う）。 (2) The image I1 generated in (1) is further subjected to the filter operation processing of the filter B by the parallel operation circuit B in FIG. 1, and the image 12 that is the result (output of the divider 43) is generated (the operation is , The operation represented by the lower equation is performed).

（３）画像Ｉ１と画像Ｉ２との減算を行い、画像Ｉ３を生成する。
以上によって、フィルタ演算をカスケード（フィルタＡ，更に、フィルタＢというようにフィルタ演算をカスケード）することが、図３のシステム構成（図３では４重のフィルタ演算が可）により可能となる。 (3) The image I1 and the image I2 are subtracted to generate the image I3.
As described above, it is possible to cascade filter operations (filter operations are cascaded such as filter A and further filter B) by the system configuration in FIG. 3 (four filter operations are possible in FIG. 3).

図１２は、本発明の正規化相関参照画像の設定フローチャートを示す。
図１２において、Ｓ３１は、画像メモリ制御回路１に転送開始座標（水平／垂直）とサイズ（水平／垂直）を指定する。これは、例えば図１の画像メモリ制御回路１に、図示外のホストから参照画像（図２の（ｂ）参照）の転送開始座標として、水平および垂直の座標と、そのサイズとして水平および垂直の座標を指定する。この際、図１の共有画像入力バス３は、１サイクルで４画素のデータを転送できるものとする。 FIG. 12 shows a flowchart for setting the normalized correlation reference image of the present invention.
In FIG. 12, the transfer start coordinates (horizontal / vertical) and size (horizontal / vertical) are designated to the image memory control circuit 1 in S31. For example, the image memory control circuit 1 shown in FIG. 1 sends the horizontal and vertical coordinates as the transfer start coordinates of a reference image (see FIG. 2B) from a host (not shown) and the horizontal and vertical sizes as the transfer coordinates. Specify coordinates. At this time, it is assumed that the shared image input bus 3 in FIG. 1 can transfer data of four pixels in one cycle.

Ｓ３２は、並列演算回路Ｃに取り込み開始座標（水平／垂直）とサイズ（水平／垂直）を指定する。これは、図１の並列演算回路Ｃの入力バッファ４１に、共有画像入力バス３から取り込む参照画像の開始座標として、水平および垂直と、そのサイズとして水平および垂直を指定する。 In S32, the acquisition start coordinates (horizontal / vertical) and size (horizontal / vertical) are designated in the parallel arithmetic circuit C. This designates horizontal and vertical as the start coordinates of the reference image fetched from the shared image input bus 3 and horizontal and vertical as the sizes thereof in the input buffer 41 of the parallel arithmetic circuit C of FIG.

Ｓ３３は、画像メモリ２から参照画像データを共有画像入力バス３にここでは、１サイクルに４画素づつ送り出す。 In S33, the reference image data is sent from the image memory 2 to the shared image input bus 3 here, four pixels per cycle.

Ｓ３４は、取り込み開始座標からサイズ分の範囲にあるデータだけを入力バッファ４２に取り込む。これは、Ｓ３３で共有画像入力バス３に送出された参照画像データから、並列演算回路ＣがＳ３２で指定された参照画像データのみをそれぞれ取り込んで入力バッファ４１に格納する。 In S 34, only data within the size range from the capture start coordinates is captured in the input buffer 42. This is because the parallel arithmetic circuit C takes in only the reference image data designated in S32 from the reference image data sent to the shared image input bus 3 in S33 and stores it in the input buffer 41.

Ｓ３５は、図示の演算を行い、正規化相関参照画像を生成する。
以上によって、画像メモリ２から参照画像データを１サイクルで４画素を順次共有画像入力バス３に送信し、並列演算回路Ｃでは自己に指定された参照画像のみを取り込んで入力バッファ４１に格納し、Ｓ３５の演算を実行することが可能となる。 In S35, the calculation shown in the figure is performed to generate a normalized correlation reference image.
As described above, the reference image data is sequentially transmitted from the image memory 2 to the shared image input bus 3 in one cycle, and the parallel arithmetic circuit C fetches only the reference image designated by itself and stores it in the input buffer 41. It is possible to execute the operation of S35.

図１３は、本発明の正規化相関演算の探索画像の設定フローチャートを示す。
図１３において、Ｓ４１は、画像メモリ制御回路１に転送開始座標（水平／垂直）とサイズ（水平／垂直）を指定する。これは、例えば図４の画像メモリ制御回路１に、図示外のホストから探索画像（図２の（ｂ）参照）の転送開始座標として、水平および垂直の座標と、そのサイズとして水平および垂直の座標を指定する。この際、図４の共有画像入力バス３は、１サイクルで４画素のデータを転送できるものとする。 FIG. 13 shows a flowchart for setting a search image for normalized correlation calculation according to the present invention.
In FIG. 13, S41 designates the transfer start coordinates (horizontal / vertical) and size (horizontal / vertical) to the image memory control circuit 1. For example, the image memory control circuit 1 in FIG. 4 sends horizontal and vertical coordinates as transfer start coordinates of a search image (see FIG. 2B) from a host (not shown) and horizontal and vertical sizes as the transfer coordinates. Specify coordinates. At this time, it is assumed that the shared image input bus 3 in FIG. 4 can transfer data of four pixels in one cycle.

Ｓ４２は、並列演算回路Ａ−Ｃそれぞれに取り込み開始座標（水平／垂直）とサイズ（水平／垂直）を指定する。これは、図４の並列演算回路Ａ．Ｂ，Ｃの入力バッファ４１に、共有画像入力バス３から取り込む探索画像の開始座標として、水平および垂直と、そのサイズとして水平および垂直を指定する。 In S42, the start coordinates (horizontal / vertical) and the size (horizontal / vertical) are designated for each of the parallel arithmetic circuits A-C. This is because the parallel arithmetic circuit A.1 in FIG. In the B and C input buffers 41, horizontal and vertical are specified as the start coordinates of the search image to be fetched from the shared image input bus 3, and horizontal and vertical are specified as sizes thereof.

尚、正規化相関演算では並列演算回路Ａ，Ｂ，Ｃの入力バッファに指定される開始座標、サイズは同一の値が指定される。 In the normalized correlation calculation, the same value is specified as the start coordinate and size specified in the input buffers of the parallel arithmetic circuits A, B, and C.

Ｓ４３は、画像メモリ２から探索画像データを共有画像入力バス３にここでは、１サイクルに４画素づつ送り出す。 In S43, the search image data is sent from the image memory 2 to the shared image input bus 3 in this case, 4 pixels per cycle.

Ｓ４４は、取り込み開始座標からサイズ分の範囲にあるデータだけを入力バッファ４２に取り込む。これは、Ｓ４３で共有画像入力バス３に送出された探索画像データから、各並列演算回路Ａ，Ｂ，ＣがＳ４２で指定された探索画像データのみをそれぞれ取り込んで入力バッファ４１に格納する。 In S44, only the data within the size range from the capture start coordinates is captured in the input buffer. This is because each of the parallel arithmetic circuits A, B, and C takes in only the search image data designated in S42 from the search image data sent to the shared image input bus 3 in S43 and stores it in the input buffer 41.

Ｓ４５は、図示の演算を並列演算回路４でそれぞれ行う。
Ｓ４６は、図示の式の演算を行い、絶対値の最大値をとる座標を求める。 In S45, the parallel calculation circuit 4 performs the calculation shown in the figure.
In S46, calculation of the equation shown in the figure is performed to obtain coordinates that take the maximum absolute value.

以上によって、画像メモリ２から探索画像データを１サイクルで４画素を順次共有画像入力バス３に送信し、各並列演算回路Ａ，Ｂ，Ｃでは自己に指定された探索画像のみを取り込んで入力バッファ４１に格納し、Ｓ４５の演算を実行した後、Ｓ４６の演算を行ってその結果の絶対値の最大値をとる座標を求めることが可能となる。 As described above, the search image data is sequentially transmitted from the image memory 2 to the shared image input bus 3 in one cycle, and each parallel arithmetic circuit A, B, C takes in only the search image designated by itself and inputs the input buffer. 41, the calculation of S45 is executed, and then the calculation of S46 is performed to obtain the coordinates that take the maximum absolute value of the result.

本発明は、個々のシストリックアレイが必要とする画像領域がオーバラップしていることを考慮して、共有画像入力バス上に複数画素分のデータ送信を行って個々のシストリックアレイが必要なデータを入力バッファに取り込むことで、単一のメモリを使って必要なメモリ容量を抑えながらもデータ待ち時間を短縮による処理時間の短縮を図ったり、個々のシストリックアレイが入力バッファからそれぞれのタイミングでデータを読み出して処理した結果に対して同期化する手段（同期化メモリ）を設け、異なるタイミングで出てくる演算結果同士のタイミングを合わせて演算したり、更に、パイプライン処理で評価値の演算並びにピーク抽出を行ったりする画像処理装置に関するものである。 In consideration of the fact that the image areas required by the individual systolic arrays overlap, the present invention requires individual systolic arrays by transmitting data for a plurality of pixels on the shared image input bus. By capturing data into the input buffer, the processing time can be shortened by reducing the data waiting time while using a single memory to reduce the required memory capacity, and each systolic array can receive each timing from the input buffer. A means (synchronization memory) is provided to synchronize the result of reading and processing the data at the same time, so that the calculation results that come out at different timings can be operated at the same time. The present invention relates to an image processing apparatus that performs calculation and peak extraction.

本発明のシステム構成図である。It is a system configuration diagram of the present invention. 本発明の説明図（図１）である。It is explanatory drawing (FIG. 1) of this invention. 本発明のシステム構成図（その２）である。It is a system configuration figure (the 2) of the present invention. 本発明のシステム構成図（その３）である。It is a system configuration figure (the 3) of the present invention. 本発明の参照画像の設定フローチャートである。It is a reference image setting flowchart of the present invention. 本発明の探索画像の設定フローチャートである。It is a setting flowchart of a search image of the present invention. 本発明の探索画像の転送説明図（３１×３１の場合）である。It is transfer explanatory drawing (in the case of 31x31) of the search image of this invention. 本発明の説明図である。It is explanatory drawing of this invention. 本発明の説明図である。It is explanatory drawing of this invention. 本発明のフィルタ係数／除数の設定フローチャートである。It is a setting flowchart of the filter coefficient / divisor of the present invention. 本発明の説明図である。It is explanatory drawing of this invention. 本発明の正規化相関探索画像の設定フローチャートである。It is a setting flowchart of the normalized correlation search image of this invention. 本発明の正規化相関演算の探索画像の設定フローチャートである。It is a setting flowchart of the search image of the normalization correlation calculation of this invention. 従来技術の説明図（その１）である。It is explanatory drawing (the 1) of a prior art. 従来技術の説明図（その２）である。It is explanatory drawing (the 2) of a prior art.

Explanation of symbols

１：画像メモリ制御回路
２：画像メモリ
３：共有画像入力バス
４：並列演算回路
４１：入力バッファ
４２：シストリックアレイ
４３：除算器
５：セレクタ
６：同期化メモリ
７：パイプライン演算回路
８：演算結果メモリ 1: Image memory control circuit 2: Image memory 3: Shared image input bus 4: Parallel operation circuit 41: Input buffer 42: Systolic array 43: Divider 5: Selector 6: Synchronization memory 7: Pipeline operation circuit 8: Calculation result memory

Claims

In an image processing apparatus that performs correlation calculation of images,
A shared image input bus for transferring data of a plurality of pixels from a memory in one cycle;
Means for selecting data for a specified vertical and horizontal size from a specified horizontal and vertical start point among a plurality of pixel data transferred on the shared image input bus;
An input buffer for storing the selected data;
A systolic array that performs operations by taking data from the input buffer;
A divider for dividing the operation result of the systolic array;
Selection means for fetching data designated in advance from the shared image input bus or data divided by the divider in the previous stage into the input buffer and inputting it to the systolic array;
The result calculated by the sheet Strick array, a synchronization memory for synchronization,
An image processing apparatus comprising: means for performing pipeline processing on computation or peak extraction between the plurality of synchronized computation results.

2. The partial area obtained by dividing the reference image area into each systolic array according to claim 1, the correlation calculation with the search area is calculated for each partial area, and the individual calculation results output at different timings are synchronized. An image processing apparatus characterized in that, after synchronization in a memory, calculation results are added together and peak detection is pipeline processed.

2. The system according to claim 1, wherein the systolic arrays are connected in cascade to perform spatial filter operations in each systolic array, and the operation output results of the individual spatial filters are synchronized in the synchronization memory, and then the operation results are subtracted. An image processing apparatus.

In claim 1, when the reference image is loaded,

When loading the search image,

3 is calculated by each of the three systolic arrays, and a normalized correlation operation is processed by performing a pipeline operation on the following equation.