JPS6326912B2

JPS6326912B2 -

Info

Publication number: JPS6326912B2
Application number: JP6232682A
Authority: JP
Inventors: Yoshiki Kobayashi; Tadashi Fukushima; Yoshuki Okuyama
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1982-04-16
Filing date: 1982-04-16
Publication date: 1988-06-01
Also published as: JPS58181171A

Description

【発明の詳細な説明】本発明は、空間積和演算等の局所近傍画像処理
を実行する並列画像処理プロセツサに係り、特に
LSI化に適したアーキテクチヤを有する並列画像
処理プロセツサに関する。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to a parallel image processing processor that performs local neighborhood image processing such as spatial product-sum operations, and particularly relates to
This invention relates to a parallel image processing processor with an architecture suitable for LSI.

画像処理プロセツサは、通産省大型プロジエク
ト「パターン情報処理システム」（昭和55年10月
に研究開発成果発表論文集が発行されている。）
にて開発されているように、画像データを並列処
理し高速化を図ろうとしているものが多い。画像
データは２次元の広がりをもつため、全ての画像
データを並列処理することは困難である。しか
し、ノイズ除去や輪郭抽出機能を実現する空間積
和演算等のように、近傍の画像データ間の演算が
多いため、例えば画像のｍ行×ｎ列の局所的なデ
ータを並列処理する例が多い。このような局所並
列形画像処理は、前記文献あるいは木戸出正継：画像処理ハードウエアの動向：情
報処理コンピユータビジヨン研究会資料８−６
（1980年９月）にて総括的に説明されているが、
CCDアナログ処理形を除いてLSI化されたものは
ない。従来のアーキテクチヤのプロセツサをその
まゝLSI化するには、集積度ピン数本発明の目的は、LSI化に適したアーキテクチ
ヤであつて、かつ、高速処理に適した並列画像処
理プロセツサを提供するにある。 The image processing processor is part of the Ministry of International Trade and Industry's large-scale project ``Pattern Information Processing System'' (a collection of research and development results was published in October 1980).
Many of them are trying to speed up the processing of image data in parallel, as has been developed in . Since image data has a two-dimensional spread, it is difficult to process all image data in parallel. However, since there are many calculations between neighboring image data, such as spatial product-sum calculations that realize noise removal and contour extraction functions, for example, it is difficult to process local data in m rows by n columns of an image in parallel. many. This type of locally parallel image processing is described in the above-mentioned document or Masatsugu Kido: Trends in Image Processing Hardware: Information Processing Computer Vision Study Group Material 8-6.
(September 1980), which is comprehensively explained.
There are no LSI versions other than the CCD analog processing type. In order to convert a processor with a conventional architecture into an LSI, the following steps must be taken: Integration degree Number of pins The purpose of the present invention is to provide a parallel image processing processor with an architecture suitable for LSI conversion and also suitable for high-speed processing. There is something to do.

本発明の特徴は、画像データ入力ポートと、前
記画像データ入力ポートからの画像データを順次
取込む複数個のシフトレジスタと、前記各シフト
レジスタの内容を入力して画像処理演算を行なう
複数個のプロセツサエレメントと、前記各プロセ
ツサエレメントの演算結果を加算する第１の演算
回路と、前段の基本モジユールにおける演算結果
データを入力する演算結果データ入力ポートと、
前記演算結果データと前記第１の演算回路の演算
結果の加算を行なう第２の演算回路と、前記第２
の演算回路の演算結果データを出力する演算結果
データ出力ポートと、前記シフトレジスタと前記
プロセツサエレメントとの間、前記プロセツサエ
レメントと前記第１の演算回路との間、および前
記第１の演算回路と前記第２の演算回路との間に
配置されたパイプラインレジスタとを基本モジユ
ール化したこと、更には、前記演算結果データ入
力ポートと前記第２の演算回路との間に第２のパ
イプラインレジスタを挿入して基本モジユール化
したところにある。 The present invention is characterized by an image data input port, a plurality of shift registers that sequentially take in image data from the image data input port, and a plurality of shift registers that input the contents of each of the shift registers and perform image processing operations. a processor element, a first arithmetic circuit that adds the arithmetic results of each of the processor elements, and an arithmetic result data input port that inputs arithmetic result data of the basic module at the previous stage;
a second arithmetic circuit that adds the arithmetic result data and the arithmetic result of the first arithmetic circuit;
between an operation result data output port that outputs operation result data of an operation circuit, between the shift register and the processor element, between the processor element and the first operation circuit, and between the first operation A pipeline register disposed between the circuit and the second arithmetic circuit is basically modularized, and further, a second pipe is provided between the arithmetic result data input port and the second arithmetic circuit. This is where line registers are inserted to create a basic module.

第１図〜第３図は、最近考えられている本発明
の前提を成す一実施例図である。 FIGS. 1 to 3 are diagrams of an embodiment that forms the premise of the present invention, which has been recently considered.

第１図は典型的な画像処理システムの構成を示
すもので、画像入力装置として工業用テレビジヨ
ンカメラ５、画像記憶装置として画像メモリ３、
及びこの内容を表示するCRTモニタ４が設けら
れている。画像メモリ３の画像情報が画像処理プ
ロセツサ２により処理され、この結果がまた画像
メモリ３に格納されたり、あるいはシステム全体
を制御する管理プロセツサ１に与えられる。 FIG. 1 shows the configuration of a typical image processing system, in which an industrial television camera 5 is used as an image input device, an image memory 3 is used as an image storage device, and an image memory 3 is used as an image storage device.
A CRT monitor 4 for displaying this content is also provided. The image information in the image memory 3 is processed by the image processing processor 2, and the results are also stored in the image memory 3 or provided to the management processor 1 which controls the entire system.

代表的な画像処理機能として空間積和演算があ
る。これは第２図に示すように、例えば４×４画
素の局所画像データf₁₁〜f₄₄に対し、定められた
荷重w₁₁〜w₄₄を乗算し総和をとるものである。 A typical image processing function is spatial product-sum operation. As shown in FIG. 2, for example, local image data of 4×4 pixels f ₁₁ to f ₄₄ are multiplied by predetermined loads w ₁₁ to w ₄₄ and the sum is calculated.

これによりノイズ除去輪郭強調等の画像処理が行える。This results in noise removal Contour enhancement Image processing such as

このような、例えば４×４画素の局所画像デー
タを処理する画像処理プロセツサとして、第３図
に示すような４個のプロセツサエレメント（PE
＃１〜＃４）１２をもつ画像処理プロセツサ基本
モジユール１０を４モジユール組合せた並列画像
処理プロセツサ（タイプと呼ぶ）２−として
いる。画像メモリ３からは、局所画像データが１
列分（第３図ではf₁₄〜f₄₄）並列に与えられ、そ
の演算結果（第３図ではｇ）が画像メモリ３に格
納される。 As an image processing processor that processes local image data of, for example, 4×4 pixels, there are four processor elements (PE) as shown in Figure 3.
A parallel image processing processor (referred to as a type) 2- is a combination of four image processing processor basic modules 10 having image processing processors #1 to #4) 12. From the image memory 3, the local image data is 1
Columns (f ₁₄ to f ₄₄ in FIG. 3) are applied in parallel, and the calculation result (g in FIG. 3) is stored in the image memory 3.

基本モジユール１０は、処理対象の行の画像デ
ータを取込む画像データ入力ポート２４、内部処
理結果を出力する演算結果データ出力ポート３５
をもつ。画像データf₁₄が入力されたとき、シフ
トレジスタ１１を介して１画素毎隣接した画素
f₁₃，f₁₂，f₁₁も対応するPE＃４〜１に入力され
る。画素f₁₁は、空間積和演算のサイズを４×４
以上に拡張する場合のために、画像データ出力ポ
ート２５から出力される。PE１２には、シフト
レジスタ１１からの処理対象の画像データｆと、
荷重記憶メモリ１５からの荷重データｗが与えら
れ、乗算が実行される。この結果が４個のPE１
２の結果を加算する演算回路１３により部分和が
とられる。演算結果入力ポート３０から入力され
る部分和が演算回路１４により次々と累算され、
演算結果出力ポート３５より次段の基本モジユー
ル１０に出力される。 The basic module 10 includes an image data input port 24 that takes in image data of a row to be processed, and a calculation result data output port 35 that outputs internal processing results.
have. When image data f ₁₄ is input, each adjacent pixel is transferred through the shift register 11.
f ₁₃ , f ₁₂ , and f ₁₁ are also input to corresponding PE#4-1. For pixel f ₁₁ , the size of the spatial product-sum operation is 4×4
In case of expansion above, the image data is outputted from the image data output port 25. The PE 12 contains image data f to be processed from the shift register 11,
Load data w from the load storage memory 15 is given, and multiplication is performed. This result is 4 PE1
A partial sum is calculated by the arithmetic circuit 13 which adds the results of the two results. The partial sums input from the calculation result input port 30 are accumulated one after another by the calculation circuit 14,
The calculation result output port 35 outputs the result to the basic module 10 at the next stage.

このようにして、基本モジユール１０を４段重
ねることにより、最終基本モジユール１０Ｄから
ｇ＝_4,4 〓^i,j=1,1 f_i,j＊w_ij が出力される。 In this way, by stacking the basic modules 10 in four stages, g= _4,4 〓 ^i,j=1,1 f _i,j *w _ij is output from the final basic module 10D.

このタイムチヤートを第４図に示す。前述した
演算が基本クロツク時間Δt1内に実行され結果ｇ
が出力され、次のΔt1では１画素分だけ移動した
４×４絵素の入力画像に対する結果ｇが出力され
ることになる。したがつて、次々と入力される画
像データに対する全ての４×４絵素の空間積和演
算結果が次々と出力される。 This time chart is shown in FIG. The above calculation is executed within the basic clock time Δt1 and the result g
is output, and in the next Δt1, the result g for the input image of 4×4 picture elements shifted by one pixel is output. Therefore, the spatial product-sum calculation results of all 4×4 picture elements for image data that are input one after another are output one after another.

第５図は本発明による並列画像処理プロセツサ
の一実施例であつて、前述の実施例のタイプＩ画
像処理プロセツサ２−の基本クロツク時間Δt1
を、パイプライン処理により短縮化した構成を示
すものである。これをタイプのパイプラインバ
ージヨンの並列画像処理プロセツサ２−Ｐと呼
ぶ。即ち、タイプでは基本クロツク時間Δt1は画像データf_i,jのシフトレジスタ１１への入力
処理プロセツサエレメント１２による積和荷重
w_i,jと画像f_i,jとの乗算処理演算回路１３による部分和処理演算回路１４による部分和累算処理の全ての処理時間の和以上である必要があつた。
これに対して、例えば第５図の例のように、と
、と、及びとの間にパイプラインレジ
スタ１６を介在させることにより、その基本クロ
ツク時間Δt2を〜の処理時間のうちの最大の
もの（全ての和でない）まで小さくすることが可
能になる。このタイムチヤートを第６図に示す。
時刻１で処理、２で、３で、４でが実行
される。時刻２では次の入力画像に対する処理
、３で、４で、５でが実行され、次々と
各構成要素をパイプライン的に動作させその処理
速度を向上することができる。 FIG. 5 shows an embodiment of the parallel image processing processor according to the present invention, in which the basic clock time Δt1 of the type I image processing processor 2- of the above-mentioned embodiment is shown.
This figure shows a configuration that is shortened by pipeline processing. This is called a pipeline version parallel image processing processor 2-P. That is, in the type, the basic clock time Δt1 is the input processing of image data f _i,j to the shift register 11, and the sum-of-products load by the processor element 12.
The multiplication process of w _i,j and the image f _i,j, the partial sum processing by the arithmetic circuit 13, and the partial sum accumulation process by the arithmetic circuit 14 had to be longer than the sum of all processing times.
On the other hand, by interposing the pipeline register 16 between and, as in the example of FIG. (not the sum of all). This time chart is shown in FIG.
The process is executed at time 1, and at time 2, 3, and 4. At time 2, the processing for the next input image, at time 3, at time 4, and at time 5 are executed, and by operating each component one after another in a pipeline manner, it is possible to improve the processing speed.

第７図は本発明の第２の実施例であり、前述の
並列画像処理プロセツサ２−Ｐの基本クロツク
Δt2を更に短縮化しうる構成を示したもので、タ
イプのパイプラインースキユーバージヨンの並
列画像処理プロセツサ２−PSと呼ぶ。第５図
のＰタイプでの基本クロツク時間Δt2は、処理
の部分和累積時間により制約される可能性が強
い。というのは基本モジユール１０をｎ段にした
場合、Δt2は演算回路１４での処理時間と演算結
果３０，３５の入出力時間との和のｎ倍の時間が
必要になるからである。特に基本モジユール１０
をLSI化した場合は入出力遅延時間は無視できな
い。このため、第５図のタイプIPに更に部分和
の累積のパスにパイプラインレジスタ１６を入
れ、基本モジユール１０Ａ〜Ｄ間での演算もパイ
プライン処理するようにしたもので、前述のΔt2
の時間規制を1/nにしている。この第７図のPS
タイプでは、第８図のタイムチヤートで示すよう
に、同時刻３で各基本モジユール１０Ａ〜Ｄの部
分和が算出され累積の部分でのタイミングが合わ
なくなる。第７図のPSでは、このタイミング
合せのための可変段数スキユー補正用シフトレジ
スタ１７を画像データ入力ポート２４に直後に設
置している。各基本モジユール１０Ａ〜Ｄの累積
パスでのパイプライン段数は１段であるため、可
変段数スキユー補正用シフトレジスタ１７の段数
は、基本モジユール１０Ａ………０段〃Ｂ………１段〃Ｃ………２段〃Ｄ………３段に設定される。このようにして第８図のタイムチ
ヤートにおける不整合（………部）が補正され、
連続したΔt3時間でのパイプライン動作が可能と
なる。 FIG. 7 shows a second embodiment of the present invention, which shows a configuration in which the basic clock Δt2 of the parallel image processing processor 2-P described above can be further shortened. It is called the processing processor 2-PS. The basic clock time Δt2 in the P type shown in FIG. 5 is highly likely to be constrained by the partial sum accumulation time of processing. This is because when the basic module 10 has n stages, Δt2 requires n times the sum of the processing time in the arithmetic circuit 14 and the input/output time of the arithmetic results 30 and 35. Especially basic module 10
When converting into LSI, input/output delay time cannot be ignored. For this reason, a pipeline register 16 is further added to the type IP shown in FIG.
The time regulation is set to 1/n. PS of this figure 7
In the type, as shown in the time chart of FIG. 8, the partial sums of the basic modules 10A to 10D are calculated at the same time 3, and the timing in the cumulative part does not match. In the PS shown in FIG. 7, a variable stage skew correction shift register 17 for timing adjustment is installed immediately after the image data input port 24. Since the number of pipeline stages in the cumulative path of each basic module 10A to D is one stage, the number of stages of the variable stage skew correction shift register 17 is as follows: Basic module 10A...0 stage B...1 stage C ......2 stages D......Set to 3 stages. In this way, the inconsistency (... section) in the time chart of Fig. 8 is corrected,
Pipeline operation is possible for continuous Δt3 time.

なお、容易にわかるように、スキユレジスタ１
７は、部分和を求める演算回路１３の直後に設置
しても、あるいは各PE１２の直前、直後に設置
しても同様にタイミングの不整合は解決される。 In addition, as can be easily understood, the skew register 1
7 can be installed immediately after the arithmetic circuit 13 for calculating the partial sum, or even if it is installed immediately before or after each PE 12, the timing mismatch will be solved in the same way.

第９図に、処理形態が異なるタイプの構成を
示し、このようなタイプであつても前記実施例
と同様、パイプライン処理を適用することができ
る。前述までのの構成では、画像データ入力を
シフトレジスタ１１を介ちて各PE１２＃１〜４
に隣接する絵素を分配していた。これに対し本実
施例では、入力画像データは各PE１２＃１〜４
に共通に与え、この乗算結果を演算回路１８、レ
ジスタ１９を介して累算して部分和Σ¹を出力す
るようにしている。この動作を第１０図のタイム
チヤートを参照して説明する。 FIG. 9 shows configurations of different types of processing, and even in such types, pipeline processing can be applied as in the above embodiment. In the configuration described above, image data is input to each PE 12#1 to 4 via the shift register 11.
It distributed pixels adjacent to . On the other hand, in this embodiment, the input image data is
are given in common, and the multiplication results are accumulated via an arithmetic circuit 18 and a register 19 to output a partial sum ^Σ1 . This operation will be explained with reference to the time chart in FIG.

時刻１で画像データ入力ポート２０より画像
f₁₁が入力され、PE12＃１にて荷重記憶メモリ１
５から読み出された荷重w₁₁との積f₁₁＊w₁₁がレ
ジスタ１９＃２にセツトされる。 Image from image data input port 20 at time 1
f ₁₁ is input, load storage memory 1 is input at PE12#1
The product f ₁₁ *w ₁₁ with the load w ₁₁ read from 5 is set in register 19 #2.

時刻２で画像データf₁₂が入力され、PE１２
＃２にて荷重w₁₂との積f₁₂＊w₁₂がとられ、これ
とレジスタ１９＃２の値f₁₁＊w₁₁との和がf₁₁＊
w₁₁＋f₁₂＊w₁₂が演算回路１８でとられ、レジス
タ１９＃３にセツトされる。 Image data _f12 is input at time 2, and PE12
In #2, _{the product f 12} _* w _{12 with the load w 12} is taken, and the sum of this and the value f ₁₁ *w ₁₁ of register 19 #2 is f ₁₁ *
w ₁₁ +f ₁₂ *w ₁₂ is taken by the arithmetic circuit 18 and set in register 19#3.

時刻３で画像データf₁₃が入力され、PE１２
＃３にて荷重w₁₃との積f₁₃＊w₁₃がとられ、これ
とレジスタ１９＃３の値f₁₁＊w₁₁＋f₁₂＊w₁₂との
和f₁₁＊w₁₁＋f₁₂＊w₁₂＋f₁₃＊w₁₃が演算回路１８
でとられ、レジスタ１９＃４にセツトされる。 Image data _f13 is input at time 3, and PE12
At #3, the product f ₁₃ *w _{13 with the load w 13} is taken, and _the sum of this and the value f ₁₁ *w ₁₁ +f ₁₂ *w ₁₂ of register 19 #3 is f ₁₁ *w ₁₁ +f ₁₂ *w ₁₂ +f ₁₃ *w ₁₃ is the calculation circuit 18
and set in register 19#4.

時刻４で画像データf₁₄が入力され、PE１２
＃４にて荷重w₁₄との積f₁₄＊w₁₄がとられ、これ
とレジスタ１９＃４の値f₁₁＊w₁₁＋f₁₂＊w₁₂＋f₁₃
＊w₁₃との和Σ¹ ₁₁＝f₁₁＊w₁₁＋〜＋f₁₄＊w₁₄が演算
回路１８でとられる。この部分和Σ₁が各基本モ
ジユール１０Ａ〜Ｄの演算回路１４で累積され、
最終段からｇ＝_4,4 〓^i,j=1,1 f_i,j＊w_i,j が出力される。 Image data _f14 is input at time 4, and PE12
At #4, _{the product f 14} _* w ₁₄ with the load w 14 is taken, and this and the value of register 19 #4 f ₁₁ *w ₁₁ +f ₁₂ *w ₁₂ +f ₁₃
The sum ^{Σ 1} ₁₁ ₌ f ₁₁ *w ₁₁ +~+f ₁₄ *w ₁₄ with *w 13 is taken by the arithmetic circuit 18 . This partial sum Σ ₁ is accumulated in the arithmetic circuit 14 of each basic module 10A to D,
g= _4,4 〓 ^i,j=1,1 f _i,j *w _i,j is output from the final stage.

以下、各基本クロツク時間Δt4間隔で空間積和
演算結果ｇが出力される。 Thereafter, the spatial product-sum calculation result g is output at intervals of each basic clock time Δt4.

このタイプの並列画像処理プロセツサ２−
にも、タイプと同様に、タイプＰ及びPS
が考えられ、基本クロツク時間Δt4を小さくする
ことが可能である。これらは容易に類推できるの
でここでは省略する。 This type of parallel image processing processor 2-
Also, types P and PS as well as types
can be considered, and it is possible to reduce the basic clock time Δt4. Since these can be easily inferred, they are omitted here.

第１１図に、更に処理形態が異なる他の実施例
を示す。前述までの各PE１２に独立に積和荷重
（メモリ）１５を与えていた方式に対し、第１１
図の構成では全PE12共通に積和荷重（メモリ）
２５を与える方式でありタイプの並列画像処理
プロセツサ２−と呼ぶ。この動作を第１２図の
タイムチヤートを参照して説明する。 FIG. 11 shows another embodiment with a further different processing form. In contrast to the method described above in which a product-sum load (memory) 15 was applied independently to each PE
In the configuration shown in the figure, the product-sum load (memory) is common to all PE12.
25, and is called a type of parallel image processing processor 2-. This operation will be explained with reference to the time chart in FIG.

まず時刻１で既に画像データ入力ポート２０よ
り画像f₁₄が入力されているとする。このときシ
フトレジスタ１１を介してPE１２＃１〜＃４に
はそれぞれf₁₁，f₁₂，f₁₃，f₁₄が与えられている。
そして荷重記憶メモリ１５から荷重w₁₁が読み出
され、それぞれの入力画像との積がとられる。演
算回路２０では、時刻１のはじめに保持している
値が“０”クリアされ、前述のf₁₁〜f₁₄とw₁₁との
積がそれぞれ保持される。 First, it is assumed that the image _f14 has already been input from the image data input port 20 at time 1. At this time, f ₁₁ , f ₁₂ , f ₁₃ , and f ₁₄ are given to PEs 12 #1 to #4 via the shift register 11, respectively.
The load w ₁₁ is then read out from the load storage memory 15 and multiplied by each input image. In the arithmetic circuit 20, the value held at the beginning of time 1 is cleared to "0", and the products of the aforementioned _f11 to _f14 and _w11 are held respectively.

時刻２では画像f₁₅が入力され、PE１２＃１〜
＃４にはそれぞれf₁₂〜₁₅が与えられ、次の荷重
w₁₂との積がとられる。この後演算回路２０で以
前の値との累積処理が行われる。例えば＃１では
f₁₁＊w₁₁＋f₁₂＊w₁₂、＃２ではf₁₂＊w₁₁＋f₁₃＊w₁₂
が結果として保持される。 At time 2, image _f15 is input, and PE12#1~
#4 is given f ₁₂ ~ ₁₅ respectively and the following loads
The product is taken with w ₁₂ . Thereafter, the arithmetic circuit 20 performs an accumulation process with the previous value. For example, in #1
f ₁₁ *w ₁₁ +f ₁₂ *w ₁₂ , f ₁₂ *w ₁₁ +f ₁₃ *w ₁₂ in #2
is retained as a result.

時刻３，４でも同上の処理が実行され、演算回
路２０＃１〜＃４には＃１：Σ¹ ₁₁＝f₁₁＊w₁₁＋f₁₂＊w₁₂＋f₁₃＊w₁₃＋f₁₄＊
w₁₄ ＃２：Σ¹ ₁₂＝f₁₂＊w₁₁＋f₁₃＊w₁₂＋f₁₄＊w₁₃＋f₁₅＊
w₁₄ ＃３：Σ¹ ₁₃＝f₁₃＊w₁₁＋f₁₄＊w₁₂＋f₁₅＊w₁₃＋f₁₆＊
w₁₄ ＃４：Σ¹ ₁₄＝f₁₄＊w₁₁＋f₁₅＊w₁₂＋f₁₆＊w₁₃＋f₁₇＊
w₁₄ とそれぞれの第１部分和が得られ、これが時刻Δ
の終りでシフトレジスタ２１にセツトされる。 The same processing as above is executed at times 3 and 4, and the arithmetic circuits 20 #1 to #4 have #1: Σ ¹ ₁₁ = f ₁₁ * w ₁₁ + f ₁₂ * w ₁₂ + f ₁₃ * w ₁₃ + f ₁₄ *
w ₁₄ #2:Σ ¹ ₁₂ =f ₁₂ *w ₁₁ +f ₁₃ *w ₁₂ +f ₁₄ *w ₁₃ +f ₁₅ *
w ₁₄ #3:Σ ¹ ₁₃ =f ₁₃ *w ₁₁ +f ₁₄ *w ₁₂ +f ₁₅ *w ₁₃ +f ₁₆ *
w ₁₄ #4:Σ ¹ ₁₄ =f ₁₄ *w ₁₁ +f ₁₅ *w ₁₂ +f ₁₆ *w ₁₃ +f ₁₇ *
w ₁₄ and the respective first partial sums are obtained, and this is obtained at time Δ
It is set in the shift register 21 at the end of the process.

時刻５〜８では、各基本モジユール１０Ａ〜Ｄ
のシフトレジスタ２１から、Σ¹ ₁₁〜Σ⁴ ₁₁，Σ¹ ₁₂〜
Σ⁴ ₁₂，Σ¹ ₁₃〜Σ⁴ ₁₃，Σ¹ ₁₄〜Σ⁴ ₁₄が演算回路１４に
より
順次累積され、結果g₁₁〜g₁₄を出力する。と同時
に、PE＃１では画像データf₁₅〜f₁₈、PE＃２では
f₁₆〜f₁₉、PE＃３ではf₁₇〜f₂₀、PE＃４ではf₁₈〜
f₂₁に対して時刻１〜４と同様の処理が実行され、
部分和Σ¹ ₁₅，Σ¹ ₁₆，Σ¹ ₁₇，Σ¹ ₁₈を求め、時刻９〜１
２
にてこれらが累積され結果g₁₅〜g₁₈が得られる。
このようにして連続して空間積和演算結果が出力
される。 At times 5-8, each basic module 10A-D
From the shift register 21, Σ ¹ ₁₁ ~ Σ ⁴ ₁₁ , Σ ¹ ₁₂ ~
Σ ⁴ ₁₂ , Σ ¹ ₁₃ to Σ ⁴ ₁₃ , Σ ¹ ₁₄ to Σ ⁴ ₁₄ are sequentially accumulated by the arithmetic circuit 14, and the results g ₁₁ to _{g 14} are output. At the same time, image data f ₁₅ to _{f 18} in PE #1 and image data f 15 to f 18 in PE #2
f ₁₆ ~ f ₁₉ , f ₁₇ ~ f ₂₀ for PE#3, f ₁₈ ~ for PE #4
The same processing as times 1 to 4 is executed for f ₂₁ ,
Find the partial sums Σ ¹ ₁₅ , Σ ¹ ₁₆ , Σ ¹ ₁₇ , Σ ¹ ₁₈ , and calculate from time 9 to 1.
2
These are accumulated and results g ₁₅ to g ₁₈ are obtained.
In this way, spatial product-sum calculation results are continuously output.

このタイプの並列画像処理プロセツサ２−
にも、タイプと同様に、タイプＰ及びPS
が考えられ、基本クロツク時間Δt5を小さくする
ことが可能である。 This type of parallel image processing processor 2-
Also, types P and PS as well as types
can be considered, and it is possible to reduce the basic clock time Δt5.

本発明によれば、拡張性に優れていることから
LSI化に適し、かつ処理速度の高速化を図ること
ができるアーキテクチヤとすることができる。 According to the present invention, since it is excellent in expandability,
It is possible to use an architecture that is suitable for LSI implementation and can increase processing speed.

[Brief explanation of the drawing]

第１図は画像処理システムの構成を示す図、第
２図は局所並列処理の例を説明する図、第３図、
第９図及び第１１図は本発明の適用対象となる並
列画像処理プロセツサのブロツク図、第５図及び
第７図は本発明による並列画像処理プロセツサの
一実施例図、第４図、第６図、第８図、第１０
図、第１２図は各並列画像処理プロセツサのタイ
ムチヤートを示す図である。２……並列画像処理プロセツサ、３……画像メ
モリ、１０……画像処理プロセツサ基本モジユー
ル、１１……入力画像シフトレジスタ、１２……
プロセツサエレメント、１３……部分和演算回
路、１４……部分和累演算回路、１５……荷重記
憶メモリ、１６……パイプラインレジスタ、１７
……（可変段数）スキユー補正シフトレジスタ、
１８……伝播・累積演算回路、１９……伝播レジ
スタ、２０……累積演算回路、２１……部分和出
力シフトレジスタ、２４……画像データ入力ポー
ト、２５……画像データ出力ポート、３０……演
算結果データ入力ポート、３５……演算結果デー
タ出力ポート。 Figure 1 is a diagram showing the configuration of an image processing system, Figure 2 is a diagram explaining an example of local parallel processing, Figure 3 is a diagram showing an example of local parallel processing,
9 and 11 are block diagrams of a parallel image processing processor to which the present invention is applied, FIGS. 5 and 7 are diagrams of an embodiment of the parallel image processing processor according to the present invention, and FIGS. Figure, Figure 8, Figure 10
12 are diagrams showing time charts of each parallel image processing processor. 2... Parallel image processing processor, 3... Image memory, 10... Image processing processor basic module, 11... Input image shift register, 12...
Processor element, 13...Partial sum calculation circuit, 14...Partial sum accumulation calculation circuit, 15...Load storage memory, 16...Pipeline register, 17
...(variable number of stages) skew correction shift register,
18... Propagation/accumulation calculation circuit, 19... Propagation register, 20... Accumulation calculation circuit, 21... Partial sum output shift register, 24... Image data input port, 25... Image data output port, 30... Calculation result data input port, 35... Calculation result data output port.

Claims

[Scope of Claims] 1. A parallel image processing processor that takes in image data from an image data supply source and performs local parallel image data processing, including an image data input port and a plurality of processors that sequentially take in image data from the image data input ports. a plurality of processor elements that input the contents of each of the shift registers and perform image processing operations; a first arithmetic circuit that adds the operation results of each of the processor elements; a calculation result data input port for inputting calculation result data in the module;
a second arithmetic circuit that adds the arithmetic result data and the arithmetic result of the first arithmetic circuit;
between an operation result data output port that outputs operation result data of an operation circuit, between the shift register and the processor element, between the processor element and the first operation circuit, and between the first operation 1. A parallel image processing processor characterized in that a plurality of sets of image processing processor basic modules each consisting of a circuit and a pipeline register arranged between the circuit and the second arithmetic circuit are arranged in parallel. 2. A parallel image processing processor that takes in image data from an image data supply source and performs local parallel image data processing, including an image data input port, a shift register that sequentially takes in image data from the image data input port, and each of the shift registers. a plurality of processor elements that input the contents of and perform image processing operations; a first arithmetic circuit that adds the operation results of each of the processor elements;
an arithmetic result data input port for inputting arithmetic result data in the preceding basic module; a second arithmetic circuit for adding the arithmetic result data and the arithmetic result of the first arithmetic circuit; between an arithmetic result data output port that outputs arithmetic result data, the shift register and the processor element, between the processor element and the first arithmetic circuit, and between the first arithmetic circuit and the first arithmetic circuit. image processing comprising a first pipeline register disposed between the second arithmetic circuit and the second arithmetic circuit; and a second pipeline register disposed between the arithmetic result data input port and the second arithmetic circuit. A parallel image processing processor characterized in that a plurality of processor basic modules are installed in parallel.