JP6800656B2

JP6800656B2 - Arithmetic circuit, its control method and program

Info

Publication number: JP6800656B2
Application number: JP2016163408A
Authority: JP
Inventors: 加藤　政美; 政美加藤; 山本　貴久; 貴久山本; 伊藤　嘉則; 嘉則伊藤; 野村　修; 修野村; 克彦森
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2016-08-24
Filing date: 2016-08-24
Publication date: 2020-12-16
Anticipated expiration: 2036-08-24
Also published as: JP2018032190A

Description

本発明は、パターン認識等に使用される演算回路、その制御方法及びプログラムに関するものである。 The present invention relates to an arithmetic circuit used for pattern recognition and the like, a control method thereof, and a program.

パターン認識装置などの画像処理装置にニューラルネットワークの手法が広く応用されている。ニューラルネットワークの中でも、ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋｓ（以下ＣＮＮと略記する）と呼ばれる演算手法が認識対象の変動に対して頑健なパターン認識を可能にする手法として注目されている。例えば、特許文献１では画像データを用いた顔認識に適用した例が提案されている。 The neural network method is widely applied to image processing devices such as pattern recognition devices. Among neural networks, a calculation method called Convolutional Neural Networks (hereinafter abbreviated as CNN) is attracting attention as a method that enables robust pattern recognition with respect to fluctuations in the recognition target. For example, Patent Document 1 proposes an example applied to face recognition using image data.

図３は簡単なＣＮＮ処理の例を示すネットワーク構成図である。入力層３０１は、画像データに対してＣＮＮ処理を行う場合、ラスタスキャンされた所定サイズの画像データに相当する。特徴面３０３ａ〜３０３ｃは第一階層３０８の特徴面を示す。特徴面とは、所定の特徴抽出演算（コンボリューション演算及び非線形処理）の処理結果に相当するデータ面である。特徴面は上位階層で所定の対象を認識するための特徴抽出結果に相当し、ラスタスキャンされた画像データに対する処理結果であるため、処理結果も面で表す。ＣＮＮでは多数の特徴面を構成するデータ群が演算処理を介して階層的に関係付けられている。 FIG. 3 is a network configuration diagram showing an example of a simple CNN process. The input layer 301 corresponds to raster-scanned image data of a predetermined size when CNN processing is performed on the image data. The feature planes 303a to 303c indicate the feature planes of the first layer 308. The feature surface is a data surface corresponding to the processing result of a predetermined feature extraction operation (convolution operation and non-linear processing). Since the feature surface corresponds to the feature extraction result for recognizing a predetermined object in the upper hierarchy and is the processing result for the raster-scanned image data, the processing result is also represented by the surface. In CNN, data groups constituting a large number of feature planes are hierarchically related via arithmetic processing.

特徴面３０３ａ〜３０３ｃは、入力層３０１に対応するコンボリューション演算及び非線形処理により生成されるものである。例えば、特徴面３０３ａは、フィルタカーネル３０２１ａに模式的に示す２次元のコンボリューション演算とその演算結果の非線形変換により生成する。例えば、フィルタカーネル（フィルタ係数マトリクス）のサイズがｃｏｌｕｍｎＳｉｚｅ×ｒｏｗＳｉｚｅであるコンボリューション演算は以下の式に示すような積和演算により処理する。 The feature planes 303a to 303c are generated by the convolution operation and the non-linear processing corresponding to the input layer 301. For example, the feature plane 303a is generated by a two-dimensional convolution operation schematically shown in the filter kernel 3021a and a non-linear transformation of the operation result. For example, a convolution operation in which the size of the filter kernel (filter coefficient matrix) is volumeSize × lowSize is processed by a product-sum operation as shown in the following equation.

ここで、「ｉｎｐｕｔ（ｘ，ｙ）」は座標（ｘ、ｙ）での参照画素値を示し、「ｏｕｔｐｕｔ（ｘ，ｙ）」は座標（ｘ、ｙ）での演算結果を示す。また、「ｗｅｉｇｈｔ（ｃｏｌｕｍｎ，ｒｏｗ）」は座標（ｘ＋ｃｏｌｕｍｎ、ｙ＋ｒｏｗ）での重み係数を示し、「ｃｏｌｕｍｎＳｉｚｅ」及び「ｒｏｗＳｉｚｅ」はカーネルサイズを示す。

Here, "input (x, y)" indicates a reference pixel value in coordinates (x, y), and "output (x, y)" indicates a calculation result in coordinates (x, y). Further, "weight (column, low)" indicates a weighting coefficient in coordinates (x + volume, y + low), and "columnSize" and "lowSize" indicate a kernel size.

ＣＮＮ処理では複数のフィルタカーネルを画素単位で走査しながら積和演算を繰り返し、最終的な積和結果を非線形変換することで特徴面を生成する。なお、特徴面３０３ａは前階層の一つの画像データから算出されるので、と結合数が１である。特徴面３０３ａを算出するためのカーネル３０２１ａは１つである。また、カーネル３０２１ｂ、カーネル３０２１ｃはそれぞれ特徴面３０３ｂ、３０３ｃを算出する際に使用されるフィルタカーネルである。以下、フィルタカーネルをフィルタ又はカーネルと略称することがある。また、フィルタカーネルのサイズは、カーネルサイズ又はフィルタサイズと略称することがある。 In the CNN process, the product-sum operation is repeated while scanning a plurality of filter kernels on a pixel-by-pixel basis, and the final product-sum result is non-linearly converted to generate a feature plane. Since the feature surface 303a is calculated from one image data in the previous layer, the number of connections is 1. There is only one kernel 3021a for calculating the feature plane 303a. Further, kernel 3021b and kernel 3021c are filter kernels used when calculating feature planes 303b and 303c, respectively. Hereinafter, the filter kernel may be abbreviated as a filter or a kernel. The size of the filter kernel may be abbreviated as kernel size or filter size.

図４はＣＮＮ処理における特徴面３０５ａを算出する例である。特徴面３０５ａは前階層３０８の３つの特徴面３０３ａ〜ｃから算出され、特徴面３０３ａ〜ｃと結合している。特徴面３０５ａのデータを算出する場合、まず、特徴面３０３ａに対しては模式的に示すカーネル３０４１ａを用いたコンボリューション演算を行い、その結果を累積加算器４０１に保持する。同様に特徴面３０３ｂ、特徴面３０３ｃに対してはそれぞれカーネル３０４２ａ、３０４３ａのコンボリューション演算を行い、その結果を累積加算器４０１に累積加算する。 FIG. 4 is an example of calculating the characteristic surface 305a in the CNN treatment. The feature surface 305a is calculated from the three feature surfaces 303a to c of the previous layer 308 and is coupled to the feature surfaces 303a to c. When calculating the data of the feature surface 305a, first, a convolution operation using the kernel 3041a schematically shown is performed on the feature surface 303a, and the result is held in the cumulative adder 401. Similarly, the kernels 3042a and 3043a are subjected to convolution operations on the feature surface 303b and the feature surface 303c, respectively, and the results are cumulatively added to the cumulative adder 401.

３種類のカーネルを用いたコンボリューション演算の終了後、ロジスティック関数や双曲正接関数（ｔａｎｈ関数）を利用した非線形変換処理４０２を行う。以上の処理を画像全体に対して１画素ずつ走査しながら処理する事で、特徴面３０５ａを生成する。図４の処理と同様に、特徴面３０５ｂは前階層３０８の３つの特徴面のそれぞれに対してカーネル３０４１ｂ、カーネル３０４２ｂ及びカーネル３０４３ｂのコンボリューション演算を用いて算出する。更に、特徴面３０７は前階層３０９の特徴面３０５ａ〜ｂのそれぞれに対してカーネル３０６１及びカーネル３０６２のコンボリューション演算を用いて算出する。 After the convolution operation using the three types of kernels is completed, the non-linear conversion process 402 using the logistic function and the hyperbolic tangent function (tanh function) is performed. The feature surface 305a is generated by performing the above processing while scanning the entire image pixel by pixel. Similar to the process of FIG. 4, the feature surface 305b is calculated for each of the three feature surfaces of the previous layer 308 by using the convolution operations of kernel 3041b, kernel 3042b, and kernel 3043b. Further, the feature surface 307 is calculated for each of the feature surfaces 305a to 305b of the previous layer 309 by using the convolution calculation of the kernel 3061 and the kernel 3062.

なお、各カーネルの係数はパーセプトロン学習やバックプロパゲーション学習等の一般的な手法を用いて予め学習により決定されているものとする。例えば、パターン認識等においては、１０×１０以上の大きなサイズのカーネルを使用してコンボリューション演算することがある。 It is assumed that the coefficients of each kernel are determined in advance by learning using general methods such as perceptron learning and backpropagation learning. For example, in pattern recognition or the like, a convolution operation may be performed using a kernel having a large size of 10 × 10 or more.

このように、ＣＮＮ処理では多数のカーネルのコンボリューション演算を繰り返すため、膨大な回数の積和演算が必要となる。 As described above, since the CNN process repeats a large number of kernel convolution operations, a huge number of product-sum operations are required.

コンボリューション演算の高速化を目的として、例えば特許文献２では複数の積和演算ユニットに共通の重み係数を設定し、入力データをシフトさせながら並列に演算することで高速にコンボリューション演算を実行する装置が提案されている。 For the purpose of speeding up the convolution operation, for example, in Patent Document 2, a weight coefficient common to a plurality of product-sum operation units is set, and the convolution operation is executed at high speed by performing the operation in parallel while shifting the input data. A device has been proposed.

また、特許文献３では、肌色情報を利用して顔検出処理を行う顔候補領域を限定する事で、全体の処理を高速化する手法が開示されている。 Further, Patent Document 3 discloses a method of speeding up the entire processing by limiting the face candidate area for performing the face detection processing by using the skin color information.

さらに、特許文献４では、誤り訂正処理を実行する並列演算装置において、並列に動作する演算器の数を制御する手法が提案されている。 Further, Patent Document 4 proposes a method of controlling the number of arithmetic units operating in parallel in a parallel arithmetic unit that executes error correction processing.

特開平１０−０２１４０６Japanese Patent Application Laid-Open No. 10-021406 特開２０１０−１３４６９７JP 2010-134697 特開２００５−２４２５８２JP-A-2005-242582 ＷＯ００／０７９４０５WO00 / 079405

しかしながら、特許文献１では、コンボリューション演算を並列に処理する場合、例えば、カーネルのサイズによって参照データとなる前階層のデータを演算器に供給するデータ転送がボトルネックになる場合がある。また、特許文献２に開示されている様な並列演算処理装置を低速なメモリと組み合わせて実現した場合でも、並列に動作する演算器の数（並列度）に見合う性能が発揮できない場合がある。他にも演算回路の並列度に見合う性能が発揮できない場合があるが、ここで、一例としてデータ転送がボトルネックになる場合、並列演算器の同時動作によって処理できるデータ量に見合うデータが転送されないので、演算器の消費電力が無駄になる。これは、ＣＮＮ処理に用いるカーネルのサイズによって、コンボリューション演算を行う際のデータ転送がボトルネックになる場合、特に問題となる。 However, in Patent Document 1, when the convolution operation is processed in parallel, for example, the data transfer for supplying the data of the previous layer to be the reference data to the arithmetic unit may become a bottleneck depending on the size of the kernel. Further, even when a parallel computing device as disclosed in Patent Document 2 is realized in combination with a low-speed memory, performance commensurate with the number of computing devices (degree of parallelism) operating in parallel may not be exhibited. In addition, there are cases where the performance commensurate with the degree of parallelism of the arithmetic circuit cannot be exhibited, but here, for example, when data transfer becomes a bottleneck, data commensurate with the amount of data that can be processed by the simultaneous operation of the parallel arithmetic units is not transferred. Therefore, the power consumption of the arithmetic unit is wasted. This becomes a particular problem when the data transfer when performing the convolution operation becomes a bottleneck depending on the size of the kernel used for CNN processing.

また、特許文献２に開示されている様な並列演算処理装置と特許文献３に開示されている処理領域限定処理を組み合わせることで、ＣＮＮ処理を高速に実現する事が可能になる。しかしながら、特許文献３の方法では処理対象領域のサイズが変更するにも係らず、特許文献２に開示されているような並列演算処理装置では、処理対象領域の如何に係らず一様な並列度で演算を実行するため、消費電力の観点から、無駄になる場合がある。 Further, by combining the parallel arithmetic processing apparatus as disclosed in Patent Document 2 and the processing area limiting processing disclosed in Patent Document 3, CNN processing can be realized at high speed. However, although the size of the processing target area is changed by the method of Patent Document 3, in the parallel computing device as disclosed in Patent Document 2, the degree of parallelism is uniform regardless of the processing target area. Since the calculation is executed in, it may be wasted from the viewpoint of power consumption.

本発明は上記の課題に鑑みてなされたものであり、データ転送がボトルネックになる等の場合において、並列に実行可能な複数の乗算器のうち、実行させる乗算器の数を適切に制御することによって消費電力を低減する演算回路を提供することを目的とする。また、その演算回路の制御方法及びプログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and when data transfer becomes a bottleneck, the number of multipliers to be executed is appropriately controlled among a plurality of multipliers that can be executed in parallel. It is an object of the present invention to provide an arithmetic circuit for reducing power consumption. Another object of the present invention is to provide a control method and a program of the arithmetic circuit.

上記課題を解決するために、本発明に係る演算回路は、以下の構成を有する。フィルタ演算処理の参照データと該フィルタ演算処理に用いるフィルタの係数データとを保持する保持装置と接続する演算回路であって、互いに異なる前記参照データと共通の前記係数データとの乗算を繰り返し行うことにより前記フィルタ演算処理を実行する複数の乗算器と、前記複数の乗算器に対して、前記保持装置からに保持された互いに異なる前記参照データを供給する第１のデータ供給手段と、前記複数の乗算器に対して、前記保持装置に保持された共通の前記係数データを供給する第２のデータ供給手段と、前記乗算器が前記乗算を繰り返し実行する時間は、前記フィルタ演算処理のフィルタサイズが大きくなるほど増加し、前記フィルタサイズが所定値以下である場合に、前記複数の乗算器のうち、一部の乗算器に対して前記乗算を実行させ、他の乗算器に対して前記乗算を実行させないように制御する制御手段と、を有することを特徴とする。
また、本発明に係る他の態様の演算回路は、
フィルタ演算処理を実行する複数の乗算器と、
前記フィルタ演算処理におけるフィルタサイズに基づいて、前記複数の乗算器のうち、一部の乗算器に対して前記フィルタ演算処理を実行させ、他の乗算器に対して前記フィルタ演算処理を実行させないように制御する制御手段と、
を有することを特徴とする。
また、本発明に係る他の態様の演算回路は、
フィルタ演算処理を実行する複数の乗算器と、
前記フィルタ演算処理の参照データのデータサイズに基づいて、前記複数の乗算器のうち、一部の乗算器に対して前記フィルタ演算処理を実行させ、他の乗算器に対して前記フィルタ演算処理を実行させないように制御する制御手段と、
を有することを特徴とする。 In order to solve the above problems, the arithmetic circuit according to the present invention has the following configuration. An arithmetic circuit connected to a holding device that holds the reference data of the filter arithmetic processing and the coefficient data of the filter used for the filter arithmetic processing, and repeatedly multiplys the reference data different from each other and the common coefficient data. A plurality of multipliers for executing the filter calculation process, a first data supply means for supplying the plurality of multipliers with different reference data held by the holding device, and the plurality of multipliers. The second data supply means for supplying the common coefficient data held in the holding device to the multiplication device and the time for the multiplication device to repeatedly execute the multiplication are determined by the filter size of the filter calculation process. increases as increases, when the filter size is less than a predetermined value, among the plurality of multipliers, to execute the multiplication for some multipliers, the multiplication with respect to the other multiplier It is characterized by having a control means for controlling the execution so as not to be executed.
Further, the arithmetic circuit of another aspect according to the present invention is
Multiple multipliers that perform filter arithmetic processing,
Based on the filter size in the filter calculation process, some multipliers among the plurality of multipliers are not allowed to execute the filter calculation process, and other multipliers are not allowed to execute the filter calculation process. Control means to control
It is characterized by having.
Further, the arithmetic circuit of another aspect according to the present invention is
Multiple multipliers that perform filter arithmetic processing,
Based on the data size of the reference data of the filter calculation process, some of the plurality of multipliers are made to execute the filter calculation process, and the other multipliers are subjected to the filter calculation process. Control means to control not to execute,
It is characterized by having.

本発明によれば、データ転送がボトルネックになる等の場合において、演算回路の並列に実行可能な複数の乗算器のうち、実行させる乗算器の数を制御することによって消費電力を低減することができる。 According to the present invention, when data transfer becomes a bottleneck, power consumption is reduced by controlling the number of multipliers to be executed among a plurality of multipliers that can be executed in parallel in an arithmetic circuit. Can be done.

第１の実施形態に係る演算回路を具備した画像処理装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the image processing apparatus provided with the arithmetic circuit which concerns on 1st Embodiment. 演算回路２２の構成を示す図である。It is a figure which shows the structure of the arithmetic circuit 22. ＣＮＮ処理の例を示すネットワーク構成図である。It is a network block diagram which shows the example of CNN processing. ＣＮＮ処理における特徴面３０５ａを生成する例である。This is an example of generating the characteristic surface 305a in the CNN treatment. 制御部５０１の構成を示す図である。It is a figure which shows the structure of the control part 501. レジスタ群６０２に設定された情報の例を説明する図である。It is a figure explaining the example of the information set in the register group 602. シフトレジスタの構成例を説明する図である。It is a figure explaining the structural example of a shift register. 乗算器の構成を説明する図である。It is a figure explaining the structure of a multiplier. 累積加算器の構成を説明する図である。It is a figure explaining the structure of the cumulative adder. 非線形変換処理部５０９の構成を説明する図である。It is a figure explaining the structure of the nonlinear conversion processing part 509. 演算回路２２によるコンボリューション演算の例を説明する図である。It is a figure explaining the example of the convolution operation by the operation circuit 22. 演算回路２２によるコンボリューション演算の動作を説明するタイムチャートである。It is a time chart explaining the operation of the convolution calculation by the calculation circuit 22. 並列度とカーネルサイズを変えた場合の処理時間の関係について説明する図である。It is a figure explaining the relationship between the degree of parallelism and the processing time when the kernel size is changed. 画像処理装置の動作を説明するフローチャートである。It is a flowchart explaining operation of an image processing apparatus. 第２の実施形態の画像処理装置の処理例を模式的に説明する図である。It is a figure which schematically explains the processing example of the image processing apparatus of 2nd Embodiment. 第２の実施形態の画像処理装置の動作を説明するフローチャートである。It is a flowchart explaining the operation of the image processing apparatus of 2nd Embodiment. （ａ）従来の並列演算を示す図である。（ｂ）第２の実施形態の並列度の決定方法の具体例を説明する図である。(A) It is a figure which shows the conventional parallel operation. (B) It is a figure explaining the specific example of the method of determining the degree of parallelism of 2nd Embodiment. 第２の実施形態の並列度決定テーブルの例を説明する図である。It is a figure explaining the example of the degree of parallelism determination table of 2nd Embodiment. 第３の実施形態の適用例を説明する図である。It is a figure explaining the application example of the 3rd Embodiment. 従来の並列演算回路の例を示す図である。It is a figure which shows the example of the conventional parallel arithmetic circuit.

以下、本発明の実施形態について添付の図面を参照して具体的に説明する。 Hereinafter, embodiments of the present invention will be specifically described with reference to the accompanying drawings.

（第１の実施形態）
まず、本発明の第１の実施形態について説明する。図１は本発明の第１の実施形態に関する並列演算回路を具備した画像処理装置の構成例である。当該画像処理装置は入力された画像データから特定の物体（画像パターン）を認識又は検出する機能を有する。画像入力モジュール２０は、光学系、ＣＣＤ又はＣＭＯＳセンサー等の光電変換デバイス及びセンサーを制御するドライバー回路、ＡＤコンバーター、各種画像補正を司る信号処理回路及びフレームバッファ等により構成される。 (First Embodiment)
First, the first embodiment of the present invention will be described. FIG. 1 is a configuration example of an image processing apparatus including a parallel arithmetic circuit according to the first embodiment of the present invention. The image processing device has a function of recognizing or detecting a specific object (image pattern) from the input image data. The image input module 20 includes an optical system, a photoelectric conversion device such as a CCD or CMOS sensor, a driver circuit for controlling the sensor, an AD converter, a signal processing circuit for controlling various image corrections, a frame buffer, and the like.

ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）５００は、画像バッファ及び演算回路２２の演算作業バッファとして使用する。ＲＡＭ５００にはＣＮＮの特徴面に相当するデータ群やフィルタ係数データなどが保持される。ＲＡＭ５００は、演算回路２２と接続し、演算回路２２に必要なデータを転送する外部のデータ保持装置としての役割を持っている。演算回路２２は本実施形態ではＣＮＮ処理を並列に行うＣＮＮ処理部である。演算回路２２の構成について、後述する。 The RAM (Random Access Memory) 500 is used as an image buffer and a calculation work buffer of the calculation circuit 22. The RAM 500 holds a data group, filter coefficient data, and the like corresponding to the characteristic surface of the CNN. The RAM 500 has a role as an external data holding device that is connected to the arithmetic circuit 22 and transfers necessary data to the arithmetic circuit 22. In this embodiment, the arithmetic circuit 22 is a CNN processing unit that performs CNN processing in parallel. The configuration of the arithmetic circuit 22 will be described later.

ＤＭＡＣ（ＤｉｒｅｃｔＭｅｍｏｒｙＡｃｃｅｓｓＣｏｎｔｒｏｌｌｅｒ）２６は、画像バス２３上の各モジュールや回路とＣＰＵバス３０間のデータ転送を司る。ブリッジ２４は、画像バス２３とＣＰＵバス３０のブリッジ機能を提供する。前処理モジュール２５は、ＣＮＮ処理によるパターン認識処理を効果的に行うための各種前処理を行う。具体的には色変換処理／コントラスト補正処理等の画像データ変換処理をハードウェアで処理する。 The DMAC (Direct Memory Access Controller) 26 controls data transfer between each module or circuit on the image bus 23 and the CPU bus 30. The bridge 24 provides a bridge function between the image bus 23 and the CPU bus 30. The pre-processing module 25 performs various pre-processing for effectively performing the pattern recognition processing by the CNN processing. Specifically, image data conversion processing such as color conversion processing / contrast correction processing is processed by hardware.

顔候補検出モジュール３１は、演算回路２２での処理対象である顔候補領域を特定する。具体的には、顔候補検出モジュール３１は、前処理モジュール２５で変換した所定の色空間内で人物の肌色領域を特定し、当該領域を演算回路２２でのＣＮＮ処理の処理対象領域とする。ここで特定した処理対象領域に関する情報はＲＡＭ５００に記録し、演算回路２２で使用する。 The face candidate detection module 31 identifies a face candidate region to be processed by the arithmetic circuit 22. Specifically, the face candidate detection module 31 specifies a skin color region of a person within a predetermined color space converted by the preprocessing module 25, and sets the region as a processing target region for CNN processing in the arithmetic circuit 22. Information about the processing target area specified here is recorded in the RAM 500 and used in the arithmetic circuit 22.

ＣＰＵ２７は、画像処理装置全体の動作を制御するものである。ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）２８は、ＣＰＵ２７の動作を規定する命令や各種演算に必要なパラメータデータを格納する。ＲＡＭ２９はＣＰＵ２７の動作に必要なメモリである。ＣＰＵ２７はブリッジ２４を介して画像バス２３上のＲＡＭ５００にアクセスする事も可能である。 The CPU 27 controls the operation of the entire image processing device. The ROM (Read Only Memory) 28 stores instructions that define the operation of the CPU 27 and parameter data necessary for various operations. The RAM 29 is a memory required for the operation of the CPU 27. The CPU 27 can also access the RAM 500 on the image bus 23 via the bridge 24.

図２は演算回路２２の構成を示す図であり、本実施形態では、演算回路２２がＣＮＮ処理を行うＣＮＮ処理部として説明するが、演算回路２２の行う演算処理はＣＮＮ処理に限らず、他の様々なフィルタ演算処理に適用することも可能である。演算回路２２が演算処理の階層的な結合関係で表現される様々な処理に適用可能である。例えば、ＲｅｓｔｒｉｃｔｅｄＢｏｌｔｚｍａｎｎＭａｃｈｉｎｅｓやＲｅｃｕｒｓｉｖｅＮｅｕｒａｌＮｅｔｗｏｒｋ等他の階層的な処理に適用可能である。 FIG. 2 is a diagram showing the configuration of the arithmetic circuit 22. In the present embodiment, the arithmetic circuit 22 is described as a CNN processing unit that performs CNN processing, but the arithmetic processing performed by the arithmetic circuit 22 is not limited to the CNN processing, and other It is also possible to apply it to various filter calculation processes of. The arithmetic circuit 22 can be applied to various processing represented by a hierarchical connection relationship of arithmetic processing. For example, it can be applied to other hierarchical processes such as Restricted Boltzmann Machines and Recursive Neural Network.

また、演算回路２２による演算は、１段の階層から図３に示すような複数段の階層まで様々な場合に適用することができる。また、演算回路２２が演算するデータは、２次元データに対する限らず、１次元データや３次元以上のデータに適用することも可能である。図２に示す演算回路２２が行うＣＮＮ処理は、図３に示す様な複数のデータ群の階層的な結合関係に従って、下位の階層から特徴面を順次算出していく。即ち、演算回路２２は、まず、入力画像データ３０１を参照データとして、順に特徴面３０３ａ、特徴面３０３ｂ及び特徴面３０３ｃを算出する。次に、特徴面３０３ａを参照データとして、順に特徴面３０５ａ及び特徴面３０５ｂを算出する。このように、演算回路２２は、階層の数に制限されずに特徴面を順に算出し、最終的に特徴面３０７を算出する。 Further, the calculation by the calculation circuit 22 can be applied in various cases from a one-stage layer to a plurality of layers as shown in FIG. Further, the data calculated by the arithmetic circuit 22 is not limited to two-dimensional data, but can be applied to one-dimensional data or three-dimensional or higher-dimensional data. The CNN process performed by the arithmetic circuit 22 shown in FIG. 2 sequentially calculates the feature planes from the lower layers according to the hierarchical connection relationship of the plurality of data groups as shown in FIG. That is, the arithmetic circuit 22 first calculates the feature surface 303a, the feature surface 303b, and the feature surface 303c in order using the input image data 301 as reference data. Next, the feature surface 305a and the feature surface 305b are calculated in order using the feature surface 303a as reference data. In this way, the arithmetic circuit 22 calculates the feature planes in order without being limited by the number of layers, and finally calculates the feature planes 307.

以下、本実施形態における演算回路２２の動作を説明する。本実施形態では、水平方向に並列にコンボリューション演算処理する場合について説明する。即ち、コンボリューション演算の結果である特徴面を基準にして水平方向に連続する複数の位置のコンボリューション演算を複数の演算器が同時に動作して実行する。本実施形態では、同時に動作してコンボリューション演算を実行する演算器（乗算器と累積加算器のペア）の数を並列度と呼ぶ。なお、垂直方向の並列演算は水平方向の並列演算と同様にできるので、その説明は省略する。 The operation of the arithmetic circuit 22 in this embodiment will be described below. In this embodiment, a case where convolution calculation processing is performed in parallel in the horizontal direction will be described. That is, a plurality of arithmetic units simultaneously operate and execute a convolution operation at a plurality of positions continuously in the horizontal direction with reference to the feature plane that is the result of the convolution operation. In the present embodiment, the number of arithmetic units (pairs of a multiplier and a cumulative adder) that operate simultaneously and execute a convolution operation is called a degree of parallelism. Since the parallel operation in the vertical direction can be performed in the same manner as the parallel operation in the horizontal direction, the description thereof will be omitted.

図２の制御部５０１は図５に示すように、演算回路２２の基本的な動作を決定するレジスタ群６０２とレジスタ群６０２の値を基に各種信号のタイミングを制御するシーケンサ６０１及びＲＡＭ５００へのアクセス調停を行うメモリ制御部６０５等からなる。 As shown in FIG. 5, the control unit 501 of FIG. 2 supplies to the sequencer 601 and the RAM 500 that control the timing of various signals based on the values of the register group 602 and the register group 602 that determine the basic operation of the arithmetic circuit 22. It includes a memory control unit 605 and the like that perform access arbitration.

演算器制御部５１７は、内部のレジスタに保持する並列度指定情報に従って、乗算器５０７ａ〜５０７ｎ及び累積加算器５０８ａ〜５０８ｎの動作を制御する信号を生成する。制御信号５１３ａ〜５１３ｎはそれぞれ乗算器５０７ａ〜５０７ｎの動作を直接制御する信号である。演算器制御部５１７は、演算器制御信号をラッチするマスクレジスタ５１１、及びマスクレジスタ５１１を所定のタイミングで制御する演算器制御データ生成部５１２からなる。演算器制御データ生成部５１２は制御部５０１のレジスタ群６０２にも接続されている。 The arithmetic unit control unit 517 generates a signal for controlling the operation of the multipliers 507a to 507n and the cumulative adders 508a to 508n according to the parallel degree designation information held in the internal register. The control signals 513a to 513n are signals that directly control the operation of the multipliers 507a to 507n, respectively. The arithmetic unit control unit 517 includes a mask register 511 that latches the arithmetic unit control signal, and an arithmetic unit control data generation unit 512 that controls the mask register 511 at a predetermined timing. The arithmetic unit control data generation unit 512 is also connected to the register group 602 of the control unit 501.

図５は制御部５０１の構成を説明する図である。シーケンス制御部６０１は、レジスタ群６０２に設定された情報に従って、演算回路２２の動作を制御する各種制御信号６０４を入出力する。同様にシーケンス制御部６０１はメモリ制御部６０５を制御するための制御信号６０６を生成する。シーケンス制御部６０１はバイナリカウンタやジョンソンカウンタ等からなるシーケンサにより構成する。レジスタ群６０２は不図示の複数のレジスタセットからなり、階層的なＣＮＮ処理を行うための情報等が保持される。 FIG. 5 is a diagram illustrating a configuration of the control unit 501. The sequence control unit 601 inputs and outputs various control signals 604 that control the operation of the arithmetic circuit 22 according to the information set in the register group 602. Similarly, the sequence control unit 601 generates a control signal 606 for controlling the memory control unit 605. The sequence control unit 601 is composed of a sequencer including a binary counter, a Johnson counter, and the like. The register group 602 is composed of a plurality of register sets (not shown), and holds information and the like for performing hierarchical CNN processing.

図６は、図５に示すレジスタ群６０２の複数のレジスタセットに設定された情報の例を示す図である。複数のレジスタセットに設定された情報の例としてのレジスタ内容１１０１ａ、レジスタ内容１１０１ｂ及びレジスタ内容１１０１ｃのそれぞれが１つの特徴面を算出するために必要な情報である。レジスタ群６０２にはブリッジ２４及び画像バス２３を介してＣＰＵ２７から予め所定の値が書き込まれる。 FIG. 6 is a diagram showing an example of information set in a plurality of register sets of the register group 602 shown in FIG. Each of the register contents 1101a, the register contents 1101b, and the register contents 1101c as examples of the information set in the plurality of register sets is the information necessary for calculating one characteristic surface. A predetermined value is written in advance from the CPU 27 to the register group 602 via the bridge 24 and the image bus 23.

なお、ここではレジスタセット内の各レジスタが３２ｂｉｔ幅であるとする。レジスタ内容１１０１ａにおいて、「最終層指定」は当該レジスタセットに対応する特徴面が最終層か否かを指定するレジスタ値である。「最終層指定」を示す当該レジスタ値が１の場合、算出対象の特徴面が最終層の特徴面であって、最終層の特徴面の算出処理が全て終了するとＣＮＮ処理が終了する。「参照データ面の数」は算出対象の特徴面と接続する前階層の特徴面の数を指定する値であり、例えば、図３に示す特徴面３０５ａを演算する場合の「参照データ面の数」のレジスタ値に「３」が設定される。「非線形変換」は、非線形変換処理の有無を指定するためのレジスタ値であり、当該レジスタ値に「１」が設定されている場合、非線形変換処理を実行する。「演算結果格納先ポインタ」は対象とする特徴面の演算結果を保持するためのＲＡＭ５００上の先頭ポインタを示すアドレスであり、当該ポインタ値を先頭ポインタとして演算結果をラスタスキャン順に格納する。 Here, it is assumed that each register in the register set has a width of 32 bits. In the register content 1101a, the "final layer designation" is a register value that specifies whether or not the feature plane corresponding to the register set is the final layer. When the register value indicating "final layer designation" is 1, the characteristic surface to be calculated is the characteristic surface of the final layer, and the CNN process ends when all the calculation processes of the feature surface of the final layer are completed. The "number of reference data planes" is a value that specifies the number of feature planes in the previous layer connected to the feature plane to be calculated. For example, the "number of reference data planes" when calculating the feature plane 305a shown in FIG. "3" is set in the register value of "". The "non-linear conversion" is a register value for designating the presence / absence of the non-linear conversion processing, and when "1" is set in the register value, the non-linear conversion processing is executed. The "calculation result storage destination pointer" is an address indicating the start pointer on the RAM 500 for holding the calculation result of the target feature surface, and the calculation result is stored in the raster scan order with the pointer value as the start pointer.

「カーネルの水平サイズ」及び「カーネルの垂直サイズ」は当該特徴面のコンボリューション演算に使用するカーネルのサイズを指定するレジスタ値である。「並列度」は、当該特徴面のコンボリューション演算を実行する際に並列に動作する演算器の数を指定するレジスタ値である。ここで設定する並列度はカーネルのサイズやＲＡＭ５００からのデータ転送能力等に対応して予め設定する。カーネルサイズ（フィルタサイズ）と並列度の関係は後述する。 The "horizontal size of the kernel" and the "vertical size of the kernel" are register values that specify the size of the kernel used for the convolution operation of the feature plane. The "degree of parallelism" is a register value that specifies the number of arithmetic units operating in parallel when executing the convolution operation of the characteristic surface. The degree of parallelism set here is set in advance according to the size of the kernel, the data transfer capacity from the RAM 500, and the like. The relationship between the kernel size (filter size) and the degree of parallelism will be described later.

「重み係数格納先」は、当該特徴面の演算に使用するカーネルの重み係数のＲＡＭ５００上の格納先アドレスを示すレジスタ値である。重み係数データは「参照データ面の数」と同じ数の係数の組を有し、「重み係数格納先」のレジスタ値で指定されるアドレスからラスタスキャン順に格納されているものとする。即ち「カーネルの水平サイズ」×「カーネルの垂直サイズ」×「参照データ面の数」の個数の係数データがＲＡＭ５００に格納されている。「参照データの垂直サイズ」のレジスタ値と、「参照データの水平サイズ」のレジスタ値とは、それぞれ参照画像データ又は参照特徴面の水平方向画素数及び垂直方向ライン数を示す情報である。 The “weight coefficient storage destination” is a register value indicating the storage destination address of the weight coefficient of the kernel used for the calculation of the characteristic surface on the RAM 500. It is assumed that the weighting coefficient data has the same number of coefficient sets as the "number of reference data surfaces" and is stored in the order of raster scan from the address specified by the register value of the "weighting coefficient storage destination". That is, the coefficient data of the number of "horizontal size of kernel" x "vertical size of kernel" x "number of reference data planes" is stored in the RAM 500. The register value of the "vertical size of the reference data" and the register value of the "horizontal size of the reference data" are information indicating the number of horizontal pixels and the number of vertical lines of the reference image data or the reference feature plane, respectively.

参照データは「参照データ格納先ポインタ」のレジスタ値の示すアドレスを先頭としてＲＡＭ５００上にラスタスキャン順に格納されているものとする。即ち「参照データの水平サイズ」×「参照データの垂直サイズ」×「参照データ面の数」の個数の参照データがＲＡＭ５００に格納されている。以上説明した複数のレジスタ値が算出する各特徴面単位に用意されている。演算の処理対象とする特徴面の「参照データ格納ポインタ」のレジスタ値が前階層結合対象特徴面の「演算結果格納先ポインタ」と等しい場合、本実施形態では、参照される前階層の特徴面と算出される特徴面とが結合されていることになる。 It is assumed that the reference data is stored in the RAM 500 in the order of raster scan, starting from the address indicated by the register value of the "reference data storage destination pointer". That is, the number of reference data of "horizontal size of reference data" x "vertical size of reference data" x "number of reference data surfaces" is stored in the RAM 500. The plurality of register values described above are prepared for each feature plane to be calculated. When the register value of the "reference data storage pointer" of the feature surface to be processed by the operation is equal to the "calculation result storage destination pointer" of the feature surface to be combined in the previous layer, in this embodiment, the feature surface of the previous layer to be referenced Is combined with the calculated feature plane.

シーケンス制御部６０１は上記「カーネルの水平サイズ」、「カーネルの垂直サイズ」、「参照データの水平サイズ」、「参照データの垂直サイズ」及び「並列度」等のレジスタ値に従って演算動作に関わるシーケンス制御を行う。メモリ制御部６０５は、シーケンス制御部６０１の生成する制御信号６０６に従って、各データバス６０７〜６０９のＲＡＭ５００からの読み出し及びＲＡＭ５００への書き込みのためのアクセスを調停する。 The sequence control unit 601 is a sequence related to the calculation operation according to the register values such as "horizontal size of kernel", "vertical size of kernel", "horizontal size of reference data", "vertical size of reference data" and "parallel degree". Take control. The memory control unit 605 arbitrates access for reading from the RAM 500 and writing to the RAM 500 of each data bus 607 to 609 according to the control signal 606 generated by the sequence control unit 601.

具体的には、メモリ制御部６０５は、画像バス６０３を介したメモリへのアクセス、参照データ６０７の読み出し、重み係数データ６０８の読み出し、演算結果データ６０９の書き出しを適切に制御する。ＲＡＭ５００へのアクセスに関する説明は後述する。なお、ＲＡＭ５００のデータ幅及び各データバス６０７〜６０９のデータ幅は全て３２ｂｉｔであるとする。 Specifically, the memory control unit 605 appropriately controls access to the memory via the image bus 603, reading of reference data 607, reading of weighting coefficient data 608, and writing of calculation result data 609. A description of access to the RAM 500 will be described later. It is assumed that the data width of the RAM 500 and the data width of each data bus 607 to 609 are all 32 bits.

図２の記憶部５０２及び記憶部５０３は、例えば複数のレジスタやメモリにより構成する。記憶部５０２はＲＡＭ５００に保持されたカーネルの重み係数データを一時的に保持するために使用される。重み係数が８ｂｉｔで表されるデータの場合、記憶部５０２は８ｂｉｔ幅の複数のレジスタで構成する。 The storage unit 502 and the storage unit 503 of FIG. 2 are composed of, for example, a plurality of registers and memories. The storage unit 502 is used to temporarily hold the kernel weighting coefficient data held in the RAM 500. In the case of data in which the weighting coefficient is represented by 8 bits, the storage unit 502 is composed of a plurality of registers having an 8-bit width.

また、記憶部５０２はコンボリューション演算を並列に処理する方向と同じ方向のカーネルサイズと同じ数のレジスタを有する。例えば、コンボリューション演算を水平方向に並列に処理する場合、水平方向のカーネルサイズが「１１」の場合、記憶部５０２のレジスタの数は「１１」とする。実際には、複数のカーネルサイズがあるので、記憶部５０２は想定する最大のカーネルサイズのレジスタ数で構成する。制御部５０１はシフトレジスタ５０４のシフト動作中にＲＡＭ５００から次の行の積和演算処理に必要なカーネルの重み係数を記憶部５０２のレジスタにロードする。 Further, the storage unit 502 has the same number of registers as the kernel size in the same direction in which the convolution operations are processed in parallel. For example, when the convolution operation is processed in parallel in the horizontal direction, the number of registers in the storage unit 502 is "11" when the kernel size in the horizontal direction is "11". Since there are actually a plurality of kernel sizes, the storage unit 502 is configured by the number of registers having the maximum kernel size assumed. The control unit 501 loads the kernel weighting coefficient required for the product-sum operation processing of the next line from the RAM 500 into the register of the storage unit 502 during the shift operation of the shift register 504.

記憶部５０３はＲＡＭ５００に格納された参照データを一時的に保持するために使用される。例えば参照データが８ｂｉｔで表されるデータの場合、記憶部５０３は８ｂｉｔ幅の複数のレジスタで構成する。記憶部５０３は「並列に処理可能なデータの数」＋「並列処理する方向と同じ方向のカーネルサイズ−１」以上の個数のレジスタで構成する。ここでは、「並列に処理可能なデータの数」は演算回路２２の最大並列度である。ここでのレジスタ個数は、一度に複数の位置の特徴面データを算出する（並列演算する）ために必要な参照するデータを得るための値であり、当該データの個数以上のレジスタ個数であれば良い。例えば，コンボリューション演算を水平方向に並列に処理するとして、カーネルサイズが「１１」、並列度が「８」の場合１８個以上の８ｂｉｔレジスタで記憶部５０３を構成することになる。 The storage unit 503 is used to temporarily hold the reference data stored in the RAM 500. For example, when the reference data is data represented by 8 bits, the storage unit 503 is composed of a plurality of 8 bit wide registers. The storage unit 503 is composed of a number of registers of "the number of data that can be processed in parallel" + "the kernel size in the same direction as the parallel processing direction-1" or more. Here, the "number of data that can be processed in parallel" is the maximum degree of parallelism of the arithmetic circuit 22. The number of registers here is a value for obtaining reference data necessary for calculating feature surface data at a plurality of positions (parallel calculation) at a time, and if the number of registers is equal to or greater than the number of the data. good. For example, assuming that the convolution operation is processed in parallel in the horizontal direction, when the kernel size is "11" and the degree of parallelism is "8", the storage unit 503 is composed of 18 or more 8-bit registers.

制御部５０１はシフトレジスタ５０５のシフト動作中にＲＡＭ５００から次の列処理に必要な参照データを記憶部５０３にロードする。即ち、コンボリューションの積和演算処理とＲＡＭ５００からのデータロードとはカーネルの行単位でパイプライン動作する。なお、実際に必要な参照データの数は「並列度」レジスタの内容に従って決まる。 The control unit 501 loads the reference data required for the next column processing from the RAM 500 into the storage unit 503 during the shift operation of the shift register 505. That is, the product-sum operation processing of the convolution and the data loading from the RAM 500 operate in a pipeline on a line-by-line basis of the kernel. The number of reference data actually required is determined according to the contents of the "parallelism" register.

シフトレジスタ５０４、シフトレジスタ５０５及びシフトレジスタ５０６はデータロード機能付のシフトレジスタである。シフトレジスタ５０４及びシフトレジスタ５０５はそれぞれ記憶部５０２及び記憶部５０３と同じｂｉｔ幅の複数のレジスタで構成し、シフトレジスタ５０６は累積加算器出力の有効ｂｉｔと同じｂｉｔ幅数の複数のレジスタで構成する。また、マスクレジスタ５１１は乗算器５０７ａ〜５０７ｎの数と同じｂｉｔ幅数の複数のレジスタで構成する。 The shift register 504, the shift register 505, and the shift register 506 are shift registers with a data load function. The shift register 504 and the shift register 505 are composed of a plurality of registers having the same bit width as the storage unit 502 and the storage unit 503, respectively, and the shift register 506 is composed of a plurality of registers having the same bit width as the effective bit of the cumulative adder output. To do. Further, the mask register 511 is composed of a plurality of registers having the same number of bit widths as the number of multipliers 507a to 507n.

シフトレジスタ５０４はシフト動作により、各乗算器５０７ａ〜５０７ｎに共通のパラメータデータ（重み係数）を順次供給するデータ供給部である。シフトレジスタ５０５はシフト動作により各乗算器５０７ａ〜５０７ｎに前階層の異なる位置の参照データを並列に供給するデータ供給部である。 The shift register 504 is a data supply unit that sequentially supplies parameter data (weighting coefficient) common to each multiplier 507a to 507n by a shift operation. The shift register 505 is a data supply unit that supplies reference data at different positions in the previous layer in parallel to the multipliers 507a to 507n by a shift operation.

シフトレジスタ５０４、シフトレジスタ５０５及びシフトレジスタ５０６は、基本的な構成は同じであるので、図７にこれらのシフトレジスタの構成例を示す。図７はレジスタ個数が４の場合の例を説明する。フリップフロップ７０１ａ〜ｄは多ｂｉｔのフリップフロップであり、ＣＬＯＣＫ信号に同期して所定ｂｉｔのデータをラッチする。セレクタ７０２ａ〜ｃは、選択信号であるＬｏａｄ信号の値が０である場合、フリップフロップ７０１ａ〜ｄの出力信号ＯＵＴｘ（ｘ：０〜２）を選択し、１である場合、フリップフロップ７０１ａ〜ｄの入力信号ＩＮｘ（ｘ：１〜３）を選択する。即ち、Ｌｏａｄ信号の値に応じてフリップフロップ７０１ａ〜ｄのシフト動作とロード動作との何れかを選択する。Ｅｎａｌｂｅ信号はデータ遷移のイネーブル信号であり、Ｅｎａｌｂｅ信号の値が１である場合、ＣＬＯＣＫ信号の立ち上がりでデータをラッチし、０である場合、前クロックでラッチしたデータをそのまま保持する（状態遷移はしない）。 Since the shift register 504, the shift register 505, and the shift register 506 have the same basic configuration, FIG. 7 shows a configuration example of these shift registers. FIG. 7 describes an example when the number of registers is 4. The flip-flops 701a to 701a to d are multi-bit flip-flops, and latch data of a predetermined bit in synchronization with the CLOCK signal. The selectors 702a to c select the output signals OUTx (x: 0 to 2) of the flip-flops 701a to d when the value of the load signal which is the selection signal is 0, and when the value is 1, the flip-flops 701a to d Input signal INx (x: 1-3) of is selected. That is, one of the shift operation and the load operation of the flip-flops 701a to 701 is selected according to the value of the load signal. The Enalbe signal is an enable signal for data transition. When the value of the Enalbe signal is 1, the data is latched at the rising edge of the CLOCK signal, and when it is 0, the data latched by the previous clock is retained as it is (the state transition is). do not do).

図２おけるＬｏａｄ２信号、Ｌｏａｄ４信号及びＬｏａｄ５信号はそれぞれ図７のＬｏａｄ信号に対応し、図２におけるＥｎａｂｌｅ１信号、Ｅｎａｂｌｅ２信号及びＥｎａｂｌｅ３信号は図７のＥｎａｂｌｅ信号に対応するものである。 The Load2 signal, the Load4 signal, and the Load5 signal in FIG. 2 correspond to the Load signal of FIG. 7, respectively, and the Enable1, Enable2 signal, and Enable3 signal of FIG. 2 correspond to the Enable signal of FIG. 7.

シフトレジスタ５０４は記憶部５０２から重み係数の初期データを一括ロードした後、水平方向のカーネルサイズと同じクロック数シフト動作を実行し、乗算器５０７ａ〜５０７ｎに対して重み係数データを連続して供給する。シフトレジスタ５０４の出力信号であるＯＵＴｎ信号は、シフトレジスタ最終段出力から出力され、全ての乗算器５０７ａ〜ｎに入力される。 The shift register 504 collectively loads the initial weight coefficient data from the storage unit 502, then executes a clock number shift operation equal to the kernel size in the horizontal direction, and continuously supplies the weight coefficient data to the multipliers 507a to 507n. To do. The OUTn signal, which is the output signal of the shift register 504, is output from the final stage output of the shift register and is input to all the multipliers 507a to 507.

同様に、シフトレジスタ５０５は記憶部５０３から参照データの初期データをロードした後、水平方向のカーネルサイズと同じクロック数シフト動作を実行し、乗算器５０７ａ〜５０７ｎに対して複数の異なる参照データを同時に供給する。シフトレジスタ５０４とシフトレジスタ５０５は同期して動作する。 Similarly, the shift register 505 loads the initial data of the reference data from the storage unit 503, then executes the same clock number shift operation as the kernel size in the horizontal direction, and outputs a plurality of different reference data to the multipliers 507a to 507n. Supply at the same time. The shift register 504 and the shift register 505 operate in synchronization.

このタイミングで参照データに重み係数を乗じた値が累積加算器に送られる。カーネルサイズの１行に相当するクロックのシフト処理により、異なる特徴面位置のカーネル１行分のコンボリューション演算を並列に処理する。更に、当該動作をカーネルの行数分繰り返すことで並列度に相当する特徴面位置の２次元コンボリューション演算を処理する。各部は制御部５０１の出力する制御信号に従って動作する。乗算器５０７ａ〜５０７ｎは並列に動作する複数の乗算器であり、累積加算器５０８ａ〜５０８ｎは並列に動作する複数の累積加算器である。乗算器５０７ａ〜５０７ｎと累積加算器５０８ａ〜５０８ｎとは一対一に接続し、合わせて演算器と称する。 At this timing, the value obtained by multiplying the reference data by the weighting coefficient is sent to the cumulative adder. By shifting the clock corresponding to one line of the kernel size, the convolution operations for one line of the kernel at different feature plane positions are processed in parallel. Further, by repeating the operation for the number of lines of the kernel, a two-dimensional convolution operation of the feature plane position corresponding to the degree of parallelism is processed. Each unit operates according to the control signal output by the control unit 501. The multipliers 507a to 507n are a plurality of multipliers operating in parallel, and the cumulative adders 508a to 508n are a plurality of cumulative adders operating in parallel. The multipliers 507a to 507n and the cumulative adders 508a to 508n are connected one-to-one and are collectively referred to as an arithmetic unit.

乗算器５０７ａ〜５０７ｎの構成例を図８に示す。本実施形態の乗算器は一般的な乗算器１４０１及びセレクタ１４０２から構成される。セレクタ１４０２は演算制御信号１４０３が有効（アクティブ）でない場合、Ｉｎｐｕｔ２の代わりに入力値０を選択する。入力値０が選択された場合、入力値Ｉｎｐｕｔ１の如何に係らず乗算器ロジック内の信号値が遷移することは無く、信号遷移に伴う消費電流の増加はない。演算制御信号１４０３は図２におけるマスクレジスタ５１１から出力される制御信号５１３ａ〜５１３ｎの何れかである。ここで、Ｉｎｐｕｔ２は参照データの入力でもよいし、係数データの入力でもよい。また、マスクレジスタ５１１の出力（演算制御信号１４０３）で乗算器１４０１への入力データをマスクする方法について説明したが、記憶部５０３で入力データを０にするなどの対応方法でもよい。 FIG. 8 shows a configuration example of the multipliers 507a to 507n. The multiplier of this embodiment is composed of a general multiplier 1401 and a selector 1402. The selector 1402 selects the input value 0 instead of Input2 when the arithmetic control signal 1403 is not valid (active). When the input value 0 is selected, the signal value in the multiplier logic does not transition regardless of the input value Input1, and the current consumption does not increase due to the signal transition. The arithmetic control signal 1403 is any of the control signals 513a to 513n output from the mask register 511 in FIG. Here, Input2 may be input of reference data or coefficient data. Further, although the method of masking the input data to the multiplier 1401 with the output of the mask register 511 (calculation control signal 1403) has been described, a corresponding method such as setting the input data to 0 in the storage unit 503 may be used.

累積加算器５０８ａ〜５０８ｎは図９に示すように加算器９０１とレジスタ９０２で構成し、ＬａｔｃｈＥｎａｂｌｅ信号に従って入力データの累積和を保持する。ＬａｔｃｈＥｎａｂｌｅ信号には図示しないクロック信号に同期した信号である。なお、乗算器の出力である乗算結果が０にスタックされている場合（即ち乗算器の動作が停止している場合）、累積加算器５０８の信号遷移が生じることもないため、乗算器同様に消費電流の増加は無い。このように、マスクレジスタ５１１が出力した制御信号５１３ａ〜５１３ｎによって乗算器と累積加算器との動作を停止させることで、乗算器と累積加算器との消費電流の増加を抑えることができる。即ち、本実施形態の演算回路の並列度は、マスクレジスタ５１１の出力した制御信号５１３ａ〜５１３ｎによって制御される。なお、演算回路の並列度は、同時に動作する演算器（乗算器又は累積加算器）の数である。演算回路の並列度が低い場合、動作する演算器が少ないので、消費電力を低く抑えることができる。 As shown in FIG. 9, the cumulative adders 508a to 508n are composed of the adder 901 and the register 902, and hold the cumulative sum of the input data according to the Latch Enable signal. The Latch Enable signal is a signal synchronized with a clock signal (not shown). When the multiplication result, which is the output of the multiplier, is stacked at 0 (that is, when the operation of the multiplier is stopped), the signal transition of the cumulative adder 508 does not occur, so that the same applies to the multiplier. There is no increase in current consumption. In this way, by stopping the operation of the multiplier and the cumulative adder by the control signals 513a to 513n output by the mask register 511, it is possible to suppress an increase in the current consumption of the multiplier and the cumulative adder. That is, the degree of parallelism of the arithmetic circuit of the present embodiment is controlled by the control signals 513a to 513n output from the mask register 511. The degree of parallelism of the arithmetic circuit is the number of arithmetic units (multipliers or cumulative adders) operating at the same time. When the degree of parallelism of the arithmetic circuit is low, the number of operating arithmetic units is small, so that the power consumption can be kept low.

ここで得られた累積和は、対象特徴面に対応するカーネル毎の演算終了後、シフトレジスタ５０６に演算結果をロードし、所定のタイミングで非線形変換処理部５０９に送る。シフトレジスタ５０６は複数個の累積加算器５０８ａ〜５０８ｎの出力を保持する事が可能なシフトレジスタである。なお、累積加算器５０８ａ〜５０８ｎの出力は所定の有効ビットのみシフトレジスタ５０６に接続する。 The cumulative sum obtained here is loaded into the shift register 506 after the calculation for each kernel corresponding to the target feature surface is completed, and sent to the nonlinear conversion processing unit 509 at a predetermined timing. The shift register 506 is a shift register capable of holding the outputs of a plurality of cumulative adders 508a to 508n. The outputs of the cumulative adders 508a to 508n are connected to the shift register 506 only for predetermined effective bits.

なお、演算回路の並列度の制御方法として、参照データや出力データの制御と共に乗算器の入力データを０にスタックさせる場合について説明したが、これに限らない。例えば、乗算器の入力データを０にスタックさせる代わりに、対応するレジスタ及び演算器の動作を制御する方法でも良い。例えば演算器毎に動作を制御する動作クロックを停止させる事で実現可能である。その場合、乗算器５０７ａ〜ｎ及び累積加算器５０８ａ〜ｎのクロック毎に論理積素子を挿入し、制御信号を接続すればよい。即ち演算器毎にクロックの供給を制御する事で演算処理の並列度を制御する。更に、乗算器５０７ａ〜ｎ及び累積加算器５０８ａ〜ｎに供給する電源を制御する等の方法でも良い。 As a method of controlling the degree of parallelism of the arithmetic circuit, a case where the input data of the multiplier is stacked at 0 together with the control of the reference data and the output data has been described, but the present invention is not limited to this. For example, instead of stacking the input data of the multiplier at 0, a method of controlling the operation of the corresponding register and the arithmetic unit may be used. For example, it can be realized by stopping the operation clock that controls the operation for each arithmetic unit. In that case, a logical product element may be inserted for each clock of the multipliers 507a to 507 and the cumulative adder 508a to n, and a control signal may be connected. That is, the degree of parallelism of arithmetic processing is controlled by controlling the clock supply for each arithmetic unit. Further, a method such as controlling the power supply supplied to the multipliers 507a to 507 and the cumulative adders 508a to n may be used.

図１０は非線形変換処理部５０９の構成を示すものである。非線形変換処理部５０９は、ルックアップテーブルで構成する非線形変換処理器１３０１及びセレクタ１３０２を含む。非線形変換処理器１３０１はルックアップテーブルに基づいて、入力Ｉｎに累積加算器の出力データである積和演算結果に対応するアドレスデータとしてＲＯＭ等に保持されたデータを参照する。ＲＯＭには予めアドレス値に対応する出力Ｏｕｔの非線形関係が記録されているものとする。セレクタ１３０２は、非線形処理変換しない場合、累積加算器の出力データである積和演算結果をそのままセレクタ１３０２の出力データとして出力する。 FIG. 10 shows the configuration of the nonlinear conversion processing unit 509. The non-linear conversion processing unit 509 includes a non-linear conversion processing unit 1301 and a selector 1302 composed of a lookup table. Based on the lookup table, the non-linear conversion processor 1301 refers to the data held in the ROM or the like as the address data corresponding to the product-sum calculation result which is the output data of the cumulative adder to the input In. It is assumed that the nonlinear relationship of the output Out corresponding to the address value is recorded in advance in the ROM. The selector 1302 outputs the product-sum calculation result, which is the output data of the cumulative adder, as the output data of the selector 1302 as it is when the non-linear processing conversion is not performed.

セレクタ１３０２は選択信号Ｓｅｌｅｃｔが制御部５０１に接続され、制御部５０１内の「非線形変換」のレジスタ値に従って制御される。ここで変換処理したデータはＲＡＭ５００の所定アドレスに格納する。ここでの格納アドレスも制御部５０１のレジスタ群６０２の設定とシーケンス制御部６０１の動作に従って制御される。以上、制御部５０１はレジスタ群６０２の内容に従って、各タイミング信号及びデータ転送を制御することで階層的なコンボリューション演算を並列に処理する。 The selection signal Select is connected to the control unit 501, and the selector 1302 is controlled according to the register value of the "nonlinear conversion" in the control unit 501. The data converted here is stored at a predetermined address of the RAM 500. The storage address here is also controlled according to the setting of the register group 602 of the control unit 501 and the operation of the sequence control unit 601. As described above, the control unit 501 processes the hierarchical convolution operation in parallel by controlling each timing signal and data transfer according to the contents of the register group 602.

図１１は本実施形態の演算回路２２によるコンボリューション演算の並列処理の例を説明する図である。図１１はラスタスキャンされたデータ座標を示す。並列処理する参照データ面１００４の各ブロック（模式的に示す最小一升）がラスタスキャン順でＲＡＭ５００に格納された入力画像又は前階層の演算結果の画素を示すものであるとする。参照データ面１００４の各画素は座標１００１においてｉｎｐｕｔ（ｘ，ｙ）で示し、ｘは水平方向位置を示し、ｙは垂直方向位置を示す。並列処理の算出対象となる特徴面１００３の各ブロックがラスタスキャン順の演算結果の画素を示すものとする。算出対象である特徴面１００３の各画素は座標１００２においてｏｕｔｐｕｔ（ｘ，ｙ）で示し、ｘは水平方向位置を示し、ｙは垂直方向位置を示す。 FIG. 11 is a diagram illustrating an example of parallel processing of convolution operations by the operation circuit 22 of the present embodiment. FIG. 11 shows the data coordinates of the raster scan. It is assumed that each block (minimum one square shown schematically) of the reference data surface 1004 to be processed in parallel indicates the input image stored in the RAM 500 in the raster scan order or the pixel of the calculation result of the previous layer. Each pixel of the reference data surface 1004 is indicated by input (x, y) at coordinates 1001, x indicates a horizontal position, and y indicates a vertical position. It is assumed that each block of the feature surface 1003 to be calculated in parallel processing indicates the pixels of the calculation result in the raster scan order. Each pixel of the feature plane 1003 to be calculated is indicated by output (x, y) at coordinates 1002, x indicates a horizontal position, and y indicates a vertical position.

特徴面の領域１００３は、図２の演算回路２２の複数の演算器が同時にコンボリューション演算して算出する特徴面データの領域を示す。特徴面の領域１００３に示す各画素は座標１００２においてｏｕｔｐｕｔ（ｘ，６）で示し、ｘは５〜１２である。図１１に示す例では、特徴面の８つの注目画素位置のコンボリューション演算を同時に処理する。また、それぞれの注目画素位置では、演算器番号０〜７に対応する乗算器５０７ａ〜５０７ｎ、累積加算器５０８ａ〜５０８ｎがそれぞれ演算処理を実行する。並列に処理する参照データ面１００４に示す各画素は、特徴面の領域１００３を算出するための参照画素データの領域を示す。図１１に示す例では、カーネルサイズが水平方向「１１」垂直方向「１３」の場合を想定している。演算回路２２の並列度が８の場合、水平方向に同時に算出する幅は「１８」であるので、参照データ面の領域１００４のサイズは、水平方向が「１８」であり、垂直方向が「１３」である。演算回路２２は、参照データ面の領域１００４のコンボリューション演算を同時に処理し、特徴面の領域１００３を同時に算出する。このように、参照データ面の水平方向に８画素単位、垂直方向に１ライン単位で走査させながら、並列に２次元のコンボリューション演算を実行する。なお、本実施形態では、水平方向に並ぶ複数の特徴面のデータを並列に算出する場合に限らず、垂直方向に連続する特徴面データを並列に算出する構成にしても良い。この場合、記憶部５０２にはカーネルの１列の重み係数がロードされ、記憶部５０３には「並列度＋カーネルの垂直方向サイズ−１」個の水平方向に連続する参照データがロードされる。なお、フィルタサイズは、並列に演算する方向によってカーネルの垂直方向サイズ又は水平方向サイズである。図１１では、カーネルの垂直方向サイズを例に説明した。 The feature surface area 1003 indicates an area of feature surface data calculated by a plurality of arithmetic units of the arithmetic circuit 22 of FIG. 2 simultaneously performing a convolution operation. Each pixel shown in the feature plane region 1003 is indicated by output (x, 6) at coordinates 1002, and x is 5 to 12. In the example shown in FIG. 11, the convolution calculation of the eight pixel positions of interest on the feature plane is processed at the same time. Further, at each of the pixel positions of interest, the multipliers 507a to 507n and the cumulative adders 508a to 508n corresponding to the arithmetic unit numbers 0 to 7 execute arithmetic processing, respectively. Each pixel shown in the reference data surface 1004 to be processed in parallel indicates an area of reference pixel data for calculating the area 1003 of the feature surface. In the example shown in FIG. 11, it is assumed that the kernel size is "11" in the horizontal direction and "13" in the vertical direction. When the degree of parallelism of the arithmetic circuit 22 is 8, the width calculated at the same time in the horizontal direction is "18". Therefore, the size of the area 1004 of the reference data surface is "18" in the horizontal direction and "13" in the vertical direction. ". The calculation circuit 22 simultaneously processes the convolution calculation of the reference data surface area 1004, and simultaneously calculates the feature surface area 1003. In this way, the two-dimensional convolution calculation is executed in parallel while scanning the reference data surface in units of 8 pixels in the horizontal direction and in units of 1 line in the vertical direction. In this embodiment, the data of a plurality of feature planes arranged in the horizontal direction is not limited to the case of calculating in parallel, and the feature plane data continuous in the vertical direction may be calculated in parallel. In this case, the weighting coefficient of one row of the kernel is loaded in the storage unit 502, and the horizontal continuous reference data of "parallel degree + vertical size of the kernel-1" is loaded in the storage unit 503. The filter size is the vertical size or the horizontal size of the kernel depending on the direction of parallel calculation. In FIG. 11, the vertical size of the kernel has been described as an example.

図１２は本実施形態の並列演算回路２２によるコンボリューション演算の動作を説明するタイムチャートである。図１２は１つの特徴面を算出するためのコンボリューション演算処理の一部を説明する図である。また、図１２に示す信号は全て図示しないクロック信号に基づいて同期動作し、クロック信号単位で１回の積和演算を処理するものとする。Ｌｏａｄ１信号は記憶部５０２にカーネルの重み係数データをロードするためのイネーブル信号を示す。制御部５０１は当該信号が有効（信号レベルが１）の期間にＲＡＭ５００からカーネルの１行分の重み係数データを読み出し、記憶部５０２に書き込む。カーネルの１行の大きさ（重み係数のデータサイズ）は図５に示すレジスタ群６０２に保持されている。また、制御部５０１はレジスタ群６０２で指定する重み係数のアドレスポインタ情報、重み係数のデータサイズ及び参照するデータ面数等を元に読み出すデータのアドレスを決定する。ここで、ＲＡＭ５００のデータ幅は３２ｂｉｔであり、重み係数のデータ幅は８ｂｉｔであるとすると、記憶部５０２に水平方向１行分の１１個の重み係数を書き込む場合、３クロックでロード処理を完了する。 FIG. 12 is a time chart illustrating the operation of the convolution calculation by the parallel calculation circuit 22 of the present embodiment. FIG. 12 is a diagram illustrating a part of a convolution calculation process for calculating one characteristic surface. Further, all the signals shown in FIG. 12 are synchronously operated based on a clock signal (not shown), and one product-sum operation is processed for each clock signal. The Load1 signal indicates an enable signal for loading kernel weighting factor data into the storage unit 502. The control unit 501 reads the weighting coefficient data for one line of the kernel from the RAM 500 during the period when the signal is valid (the signal level is 1), and writes it in the storage unit 502. The size of one line of the kernel (data size of the weighting factor) is held in the register group 602 shown in FIG. Further, the control unit 501 determines the address of the data to be read based on the address pointer information of the weighting coefficient specified by the register group 602, the data size of the weighting coefficient, the number of reference data faces, and the like. Here, assuming that the data width of the RAM 500 is 32 bits and the data width of the weighting coefficient is 8 bits, when writing 11 weighting coefficients for one horizontal line to the storage unit 502, the loading process is completed in 3 clocks. To do.

以降、ＲＡＭ５００に対する読み出し／書き込みサイクルは全て１クロックで完了するものとする。制御部５０１は、重み係数のロードが完了すると、次に参照データのロードを開始するためにＬｏａｄ３信号を有効化する。Ｌｏａｄ３信号もＬｏａｄ１信号と同様に信号レベル１の場合が有効化された状態であるとする。 After that, it is assumed that all the read / write cycles for the RAM 500 are completed in one clock. When the loading of the weighting factors is completed, the control unit 501 then activates the Load3 signal to start loading the reference data. As with the Road1 signal, it is assumed that the Road3 signal is in the enabled state when the signal level is 1.

制御部５０１はＬｏａｄ３信号の有効化と同時にＲＡＭ５００から参照データを取り出し、記憶部５０３にセットする。セットするデータの数はレジスタ群６０２に保持されているカーネルの大きさ及び並列度から決定する。また、制御部５０１は、レジスタ群６０２で指定する参照データのアドレスポインタ情報、参照データのサイズ及び参照データ面の数を元にＲＡＭ５００から読み出すデータのアドレスを決定する。参照データの有効桁は８ｂｉｔであるため、記憶部５０３に、例えば１８個の参照データを書き込む場合、５サイクルで書き込みシーケンスを完了する。図１１に示す例の場合は、カーネルの水平方向サイズが１１、演算並列度が８である事から並列度１１＋８−１＝１８個の参照データをロードする必要がある。 The control unit 501 takes out reference data from the RAM 500 at the same time as enabling the Load3 signal, and sets it in the storage unit 503. The number of data to be set is determined from the size and parallelism of the kernel held in the register group 602. Further, the control unit 501 determines the address of the data to be read from the RAM 500 based on the address pointer information of the reference data specified by the register group 602, the size of the reference data, and the number of reference data surfaces. Since the effective digit of the reference data is 8 bits, for example, when writing 18 reference data to the storage unit 503, the writing sequence is completed in 5 cycles. In the case of the example shown in FIG. 11, since the kernel has a horizontal size of 11 and an operation parallel degree of 8, it is necessary to load reference data having a parallel degree of 11 + 8-1 = 18.

＊ＣＬＲ信号は累積加算器５０８を初期化するための信号であり、当該信号の値が０の場合、累積加算器のレジスタ９０２の値は０に初期化される。制御部５０１は新たな特徴面位置のコンボリューション演算開始前に＊ＣＬＲ信号の値を０に設定する。 * The CLR signal is a signal for initializing the cumulative adder 508, and when the value of the signal is 0, the value of the register 902 of the cumulative adder is initialized to 0. The control unit 501 sets the value of the * CLR signal to 0 before starting the convolution calculation of the new feature plane position.

Ｌｏａｄ２信号はシフトレジスタ５０４の初期化を指示するための信号であり、当該信号の値が１でかつＥｎａｂｌｅ１信号が有効（信号レベル１）の場合、記憶部５０２に保持する複数の重み係数データがシフトレジスタ５０４に一括ロードされる。Ｅｎａｂｌｅ１信号はシフトレジスタのデータ遷移を制御する信号である。図１２に示すようにＥｎａｂｌｅ１信号は動作中に常に設定されているため、Ｌｏａｄ２信号の値が１の場合、クロック信号に応じて記憶部５０２の出力をラッチし、Ｌｏａｄ２信号の値が０の場合、クロック信号に応じてシフト処理を継続する。 The Load2 signal is a signal for instructing the initialization of the shift register 504, and when the value of the signal is 1 and the Enable1 signal is valid (signal level 1), a plurality of weight coefficient data held in the storage unit 502 are stored. It is collectively loaded into the shift register 504. The Enable1 signal is a signal that controls the data transition of the shift register. As shown in FIG. 12, since the Enable1 signal is always set during operation, when the value of the Load2 signal is 1, the output of the storage unit 502 is latched according to the clock signal, and when the value of the Load2 signal is 0. , The shift process is continued according to the clock signal.

制御部５０１のシーケンス制御部６０１はカーネルの水平方向のデータサイズに応じたクロック数をカウントするとＬｏａｄ２信号を有効化する。 The sequence control unit 601 of the control unit 501 activates the Load2 signal when the number of clocks corresponding to the horizontal data size of the kernel is counted.

更に、シフト動作を停止させると同時に記憶部５０２に保持する重み係数データをシフトレジスタ５０４に一括ロードする。即ち、カーネルの水平方向単位で一行の重み係数を一括ロードし、ロードした重み係数を動作クロックに応じてシフトアウトする。 Further, at the same time as stopping the shift operation, the weighting coefficient data held in the storage unit 502 is collectively loaded into the shift register 504. That is, the weighting coefficients of one line are collectively loaded in the horizontal unit of the kernel, and the loaded weighting coefficients are shifted out according to the operation clock.

Ｌｏａｄ４信号はシフトレジスタ５０５の初期化を指示するための信号であり、Ｌｏａｄ４信号の値が１でかつＥｎａｂｌｅ２信号が有効（信号レベル１）の場合、記憶部５０３に保持する参照データがシフトレジスタ５０５に一括ロードされる。 The Load4 signal is a signal for instructing the initialization of the shift register 505. When the value of the Load4 signal is 1 and the Enable2 signal is valid (signal level 1), the reference data held in the storage unit 503 is the shift register 505. It is loaded all at once.

なお、Ｅｎａｂｌｅ２信号はシフトレジスタのデータ遷移を制御する信号である。図１２に示すようにＥｎａｂｌｅ２信号の値は動作中に１に設定されているため、Ｌｏａｄ４信号の値が１である場合、クロック信号に応じて記憶部５０３の出力をラッチし、Ｌｏａｄ４信号の値が０の場合、クロック信号に応じてシフト処理を継続する。 The Enable2 signal is a signal that controls the data transition of the shift register. As shown in FIG. 12, since the value of the Enable2 signal is set to 1 during operation, when the value of the Load4 signal is 1, the output of the storage unit 503 is latched according to the clock signal, and the value of the Load4 signal is set. If is 0, the shift process is continued according to the clock signal.

制御部５０１のシーケンス制御部６０１はカーネルの水平方向のデータサイズに応じたクロック数をカウントするとＬｏａｄ４信号を有効化し、シフト動作を停止させると同時に記憶部５０３に保持する参照データを一括ロードする。 When the sequence control unit 601 of the control unit 501 counts the number of clocks corresponding to the data size in the horizontal direction of the kernel, the Load4 signal is activated, the shift operation is stopped, and at the same time, the reference data held in the storage unit 503 is collectively loaded.

即ち、シーケンス制御部６０１はコンボリューション演算処理におけるカーネルの１行単位で必要な参照データを記憶部５０３からシフトレジスタ５０５に一括ロードし、シフトレジスタ５０５はロードした参照データを動作クロックに応じてシフトする。また、制御部５０１はＬｏａｄ４信号をＬｏａｄ２信号と同一タイミングで制御する。 That is, the sequence control unit 601 collectively loads the reference data required for each line of the kernel in the convolution operation processing from the storage unit 503 into the shift register 505, and the shift register 505 shifts the loaded reference data according to the operation clock. To do. Further, the control unit 501 controls the Load4 signal at the same timing as the Load2 signal.

なお、Ｅｎａｂｌｅ１信号及びＥｎａｂｌｅ２信号は、図１２に示す「水平方向演算サイクル」において、記憶部５０２及び記憶部５０３からシフトレジスタ５０４及びシフトレジスタ５０５へデータロードが間に合わない場合がある。この場合は、シーケンス制御部６０１はＥｎａｂｌｅ１信号及びＥｎａｂｌｅ２信号を非有効化することで演算器の動作を停止し、データロード時間を確保する。シーケンス制御部６０１は、その際、累積加算器５０８ａ〜５０８ｎのラッチ信号も非有効化する。累積加算器５０８ａ〜５０８ｎはクロックに同期して積和演算を継続しているため、シフトレジスタ５０４及びシフトレジスタ５０５のシフト動作に従って算出する特徴面の複数の位置に対して、同時にカーネルサイズに応じた積和演算処理を実行する。 Note that the Enable1 signal and the Enable2 signal may not be loaded in time from the storage unit 502 and the storage unit 503 to the shift register 504 and the shift register 505 in the "horizontal calculation cycle" shown in FIG. In this case, the sequence control unit 601 stops the operation of the arithmetic unit by disabling the Enable1 signal and the Enable2 signal, and secures the data load time. At that time, the sequence control unit 601 also disables the latch signals of the cumulative adders 508a to 508n. Since the cumulative adders 508a to 508n continue the multiply-accumulate operation in synchronization with the clock, the kernel size is simultaneously applied to a plurality of positions of the feature plane calculated according to the shift operation of the shift register 504 and the shift register 505. Executes the product-sum operation process.

具体的には、シフトレジスタ５０４及びシフトレジスタ５０５のシフト動作期間（図１２中の水平方向演算サイクル期間）中に複数の特徴面の画素位置のカーネル１行分の積和演算がなされることになる。さらに、図１２に示すカーネル演算区間において、カーネルの列単位の演算を重み係数及び参照データを入替ながら垂直方向に繰り返すことで並列度に応じた二次元のコンボリューション演算結果が得られる。 Specifically, during the shift operation period (horizontal calculation cycle period in FIG. 12) of the shift register 504 and the shift register 505, the product-sum calculation for one kernel line of the pixel positions of the plurality of feature planes is performed. Become. Further, in the kernel operation section shown in FIG. 12, a two-dimensional convolution operation result corresponding to the degree of parallelism can be obtained by repeating the operation for each column of the kernel in the vertical direction while exchanging the weighting coefficient and the reference data.

このように、制御部５０１はカーネルサイズ及び並列度に応じて各信号を制御する事で、積和演算処理と積和演算処理に必要なデータ（重み係数データ及び参照データ）のＲＡＭ５００からの供給を並行に処理する。図１２に示す例の場合、水平方向演算サイクル内に、参照データのＲＡＭ５００から記憶部５０３へのロード及び重み係数のＲＡＭ５００から記憶部５０２へのロードが完了しているため、データロードに要する時間が演算速度に影響を与えることはない。しかしながら、コンボリューション演算のカーネルサイズによっては、ＲＡＭ５００から記憶部５０２や記憶部５０３へのデータロードに要する時間が水平方向演算サイクル内に収まらない場合がある。 In this way, the control unit 501 controls each signal according to the kernel size and the degree of parallelism, so that the product-sum calculation process and the data (weight coefficient data and reference data) required for the product-sum calculation process are supplied from the RAM 500. Are processed in parallel. In the case of the example shown in FIG. 12, since the loading of the reference data from the RAM 500 to the storage unit 503 and the loading of the weighting coefficient from the RAM 500 to the storage unit 502 are completed within the horizontal calculation cycle, the time required for data loading is completed. Does not affect the calculation speed. However, depending on the kernel size of the convolution operation, the time required for loading data from the RAM 500 to the storage unit 502 or the storage unit 503 may not be within the horizontal calculation cycle.

図１３はカーネルサイズと並列度とを変えた場合のデータロードサイクル（データロードに要する時間）と水平方向演算サイクルの関係を示す図である。図１３は図１２に示す水平方向演算サイクル単位のデータロードサイクルを算出する例である。 FIG. 13 is a diagram showing the relationship between the data load cycle (time required for data load) and the horizontal calculation cycle when the kernel size and the degree of parallelism are changed. FIG. 13 is an example of calculating the data load cycle for each horizontal calculation cycle shown in FIG.

並列度は同時に動作する乗算器５０７ａ〜５０７ｎと累積加算器５０８ａ〜５０８ｎのペアの数を示す。「参照データロード数」は図１２に示す「水平方向演算サイクル」内に、ＲＡＭ５００から記憶部５０３にロードする次の演算に必要な参照データの数を示す。参照データロード数は「並列度＋カーネルの水平方向サイズ−１」である。「重み係数ロード数」は図１２に示す「水平方向演算サイクル」内に、ＲＡＭ５００から記憶部５０２にロードする重み係数のデータ数を示す。 The degree of parallelism indicates the number of pairs of multipliers 507a to 507n and cumulative adders 508a to 508n that operate simultaneously. The "reference data load number" indicates the number of reference data required for the next calculation to be loaded from the RAM 500 into the storage unit 503 within the "horizontal calculation cycle" shown in FIG. The number of reference data loads is "parallelism + kernel horizontal size-1". The "weight coefficient load number" indicates the number of weight coefficient data to be loaded from the RAM 500 to the storage unit 502 in the "horizontal calculation cycle" shown in FIG.

本実施形態の場合、次の演算に必要なデータロードサイクルの総数に相当する「総ロードサイクル」は以下の式を用いて算出できる。 In the case of this embodiment, the "total load cycle" corresponding to the total number of data load cycles required for the next calculation can be calculated using the following formula.

総ロードサイクル＝ＩＮＴ（（参照データロード数＋３）÷４）＋ＩＮＴ（（重み係数ロード数＋３）÷４）（２）
ここで、ＩＮＴ（ｎ）はｎを超えない最大の整数を求める関数である。参照データと重み係数データは夫々８ｂｉｔ（１バイト）である。 Total load cycle = INT ((reference data load number + 3) ÷ 4) + INT ((weight coefficient load number + 3) ÷ 4) (2)
Here, INT (n) is a function for finding the maximum integer that does not exceed n. The reference data and the weighting coefficient data are 8 bits (1 byte) each.

なお、式（２）は、前述した様にデータバスの幅は３２ｂｉｔである（即ち１回のアクセスで４バイト転送可能である）場合の算出式である。また、１回のメモリアクセスは１サイクルで完了するものとする。 The formula (2) is a calculation formula when the width of the data bus is 32 bits (that is, 4 bytes can be transferred with one access) as described above. Further, it is assumed that one memory access is completed in one cycle.

図１３に示す「カーネル１行の演算サイクル」は、カーネル１行分のコンボリューション演算を行うためのサイクル数である。「カーネル１行の演算サイクル」はカーネルの水平方向サイズの値に相当する。「所要サイクル」はカーネル１行分のコンボリューション演算を行うために必要なサイクル数を示す。図１３に示すように、「総ロードサイクル」が「カーネル１行の演算サイクル」より小さい場合、「カーネル１行の演算サイクル」が「所要サイクル」の値に相当する。この場合、演算処理時間が処理時間を決定することになる（即ち演算ボトルネック）。図１３においてカーネルサイズが１３×１３の場合、「総ロードサイクル」が「カーネル１行の演算サイクル」より小さいので、「所要サイクル」の値は「カーネル１行の演算サイクル」の値と同じである。カーネルサイズが大きい場合、所要サイクルは常に「カーネル演算サイクル」に等しい。 The “calculation cycle of one kernel line” shown in FIG. 13 is the number of cycles for performing a convolution operation for one line of the kernel. The "one-line kernel operation cycle" corresponds to the horizontal size value of the kernel. "Required cycle" indicates the number of cycles required to perform the convolution operation for one kernel line. As shown in FIG. 13, when the "total load cycle" is smaller than the "calculation cycle of one kernel line", the "calculation cycle of one kernel line" corresponds to the value of the "required cycle". In this case, the arithmetic processing time determines the processing time (that is, the arithmetic bottleneck). In FIG. 13, when the kernel size is 13 × 13, the “total load cycle” is smaller than the “calculation cycle of one kernel line”, so the value of the “required cycle” is the same as the value of the “calculation cycle of one kernel line”. is there. For large kernel sizes, the required cycle is always equal to the "kernel arithmetic cycle".

一方、「総ロードサイクル」が「カーネル１行の演算サイクル」より大きい場合、「総ロードサイクル」が「所要サイクル」の値に相当する。この場合、データ転送時間が処理時間を決定する（即ちメモリアクセスボトルネック）。カーネルサイズが小さい場合（図１３においてカーネルサイズが３×３である場合）、所要サイクルは並列度に応じて変化する。並列度が７を超えると、「総ロードサイクル」が「カーネル１行の演算サイクル」を超えるので、データ転送がボトルネックとなり、「所要サイクル」が「カーネル１行の演算サイクル」より大きい「総ロードサイクル」の値となる。 On the other hand, when the "total load cycle" is larger than the "calculation cycle of one kernel line", the "total load cycle" corresponds to the value of the "required cycle". In this case, the data transfer time determines the processing time (ie, the memory access bottleneck). When the kernel size is small (when the kernel size is 3 × 3 in FIG. 13), the required cycle changes according to the degree of parallelism. When the degree of parallelism exceeds 7, the "total load cycle" exceeds the "calculation cycle of one kernel line", so data transfer becomes a bottleneck, and the "required cycle" is larger than the "calculation cycle of one kernel line". It becomes the value of "load cycle".

このような場合、図２に示す演算回路の制御部５０１は、ＲＡＭ５００へアクセス完了（記憶部５０２及び記憶部５０３へのデータロード及び非線形変換処理部５０９のデータセーブ）を優先する。即ち、Ｅｎａｂｌｅ１信号、Ｅｎａｂｌｅ２信号、Ｅｎａｂｌｅ３信号及び累積加算器のＬａｔｃｈＥｎａｂｌｅ信号等を制御する。それによって、シフトレジスタ５０４とシフトレジスタ５０５とのシフト動作及び累積加算器５０８ａ〜５０８ｎの動作を停止し、積和演算処理の開始をデータロードの完了まで遅延させる。 In such a case, the control unit 501 of the arithmetic circuit shown in FIG. 2 gives priority to the completion of access to the RAM 500 (data loading to the storage units 502 and 503 and data saving of the nonlinear conversion processing unit 509). That is, it controls the Enable1 signal, the Enable2 signal, the Enable3 signal, the Latch Enable signal of the cumulative adder, and the like. As a result, the shift operation of the shift register 504 and the shift register 505 and the operation of the cumulative adders 508a to 508n are stopped, and the start of the multiply-accumulate operation process is delayed until the completion of the data load.

「１画素当たり処理時間」はカーネル１行分のコンボリューション演算を並列に実行する際の、１画素当たりの処理時間を表す値である。「１画素当たり処理時間」は以下の式で定義する。 The "processing time per pixel" is a value representing the processing time per pixel when the convolution operations for one kernel line are executed in parallel. "Processing time per pixel" is defined by the following formula.

１画素当たり処理時間＝所要サイクル÷並列度 ‥ （３）
カーネルサイズが大きい場合（図１３においてカーネルサイズが１３×１３の場合）「１画素当たり処理時間」は並列度と共に低下する。即ち、並列度が大きいほど全体の処理時間が短くなる。一方、カーネルサイズが小さい場合（図１３においてカーネルサイズが３×３の場合）、例えば、並列度が７では、並列度が６に比べて「一画素当たり処理時間」が大きくなる。即ち、並列度を上げたにもかかわらず、処理時間が増大する。並列度を８にした場合であっても、「一画素当たり処理時間」は並列度が６の場合と同じである。即ち、並列度を上げたにもかかわらず、処理時間は変わらない。これは、データ転送がボトルネックとなって、ＲＡＭ５００から記憶部５０２及び記憶部５０３へデータをロードする時間が乗算器５０７ａ〜５０７ｎ及び累積加算器５０８ａ〜５０８ｎの演算処理の時間より長いためである。 Processing time per pixel = required cycle ÷ degree of parallelism (3)
When the kernel size is large (when the kernel size is 13 × 13 in FIG. 13), the “processing time per pixel” decreases with the degree of parallelism. That is, the larger the degree of parallelism, the shorter the overall processing time. On the other hand, when the kernel size is small (when the kernel size is 3 × 3 in FIG. 13), for example, when the parallel degree is 7, the “processing time per pixel” is larger than when the parallel degree is 6. That is, the processing time increases even though the degree of parallelism is increased. Even when the degree of parallelism is 8, the "processing time per pixel" is the same as when the degree of parallelism is 6. That is, the processing time does not change even though the degree of parallelism is increased. This is because the data transfer becomes a bottleneck and the time for loading data from the RAM 500 to the storage unit 502 and the storage unit 503 is longer than the arithmetic processing time of the multipliers 507a to 507n and the cumulative adders 508a to 508n. ..

並列度が８の場合、乗算器５０７ａ〜５０７ｎ及び累積加算器５０８ａ〜５０８ｎの同時動作数が８であり、並列度６の場合に比べて多いため、演算時のピーク消費電流がより大きい。デバイスのダイナミックな消費電力は電流値の２乗和に比例するため、消費電力の観点から並列度は出来るだけ低いことが望ましい。 When the degree of parallelism is 8, the number of simultaneous operations of the multipliers 507a to 507n and the cumulative adders 508a to 508n is 8, which is larger than that in the case of the degree of parallelism 6, so that the peak current consumption during calculation is larger. Since the dynamic power consumption of the device is proportional to the sum of squares of the current values, it is desirable that the degree of parallelism is as low as possible from the viewpoint of power consumption.

従って、本実施形態では、制御部５０１は演算回路２２の並列度の最大値に設定された場合に、演算処理時間とデータ転送時間と同じ程度同じ程度になるカーネルサイズ（フィルタサイズ）を所定値とする。カーネルサイズ（フィルタサイズ）が所定値以下である場合はメモリアクセスボトルネックになるので、フィルタサイズに基づいて制御部５０１は演算回路２２の並列度を最大値より低く設定し、演算処理時間とデータ転送時間とを同じ程度にする。例えば、カーネルサイズ（フィルタサイズ）が３×３の場合、制御部５０１は、演算回路２２の並列度を６に設定する。これにより、コンボリューション演算の処理速度は変わらず、消費電力を低減することが可能になる。なお、演算回路２２の並列度を毎回算出するのではなく、カーネルサイズ（フィルタサイズ）と演算回路２２の並列度との対応関係を示すデータテーブルを予めＲＡＭ５００に保持することができる。制御部５０１はＲＡＭ５００に保持されたデータテーブルを参照することによって、演算回路２２の並列度を設定することができる。 Therefore, in the present embodiment, when the control unit 501 is set to the maximum value of the degree of parallelism of the arithmetic circuit 22, the kernel size (filter size) that becomes about the same as the arithmetic processing time and the data transfer time is set to a predetermined value. And. If the kernel size (filter size) is less than a predetermined value, it becomes a memory access bottleneck. Therefore, based on the filter size, the control unit 501 sets the degree of parallelism of the arithmetic circuit 22 to be lower than the maximum value, and performs arithmetic processing time and data. Make it about the same as the transfer time. For example, when the kernel size (filter size) is 3 × 3, the control unit 501 sets the degree of parallelism of the arithmetic circuit 22 to 6. As a result, the processing speed of the convolution operation does not change, and the power consumption can be reduced. Instead of calculating the degree of parallelism of the arithmetic circuit 22 every time, a data table showing the correspondence between the kernel size (filter size) and the degree of parallelism of the arithmetic circuit 22 can be held in the RAM 500 in advance. The control unit 501 can set the degree of parallelism of the arithmetic circuit 22 by referring to the data table held in the RAM 500.

本実施形態では、このような観点からカーネルサイズに応じて並列度を設定する。並列度の設定は、前述した様に、図６に示すレジスタの内容に従う。図２において制御部５０１は当該レジスタに記された並列度に従って並列演算回路の動作を制御する。例えば、並列処理が可能な演算器（乗算器５０７ａ〜５０７ｎ及び累積加算器５０８ａ〜５０８ｎ）の数が８であっても、カーネルサイズが３×３の場合、並列度を６に設定して演算器を動作させる。 In the present embodiment, the degree of parallelism is set according to the kernel size from such a viewpoint. As described above, the degree of parallelism is set according to the contents of the registers shown in FIG. In FIG. 2, the control unit 501 controls the operation of the parallel computing circuit according to the degree of parallelism recorded in the register. For example, even if the number of arithmetic units (multipliers 507a to 507n and cumulative adders 508a to 508n) capable of parallel processing is 8, when the kernel size is 3 × 3, the degree of parallelism is set to 6 for calculation. Operate the vessel.

制御部５０１は、この場合、ＲＡＭ５００から記憶部５０３には並列度が６に対応する数の参照データをロードし、演算処理を開始する。演算器制御データ生成部５１２は、演算処理に先立ち、レジスタ群６０２に記された「並列度」情報に従って、演算器制御信号５１４を生成する。例えば、並列度が６の場合、演算器制御データ生成部５１２は、演算器並列信号「１１１１１１００」を生成する。各ビットが夫々の制御信号を示し、１が動作演算器に対応し、０が停止演算器に対応する。 In this case, the control unit 501 loads a number of reference data corresponding to the degree of parallelism from the RAM 500 into the storage unit 503, and starts the arithmetic processing. The arithmetic unit control data generation unit 512 generates the arithmetic unit control signal 514 according to the "parallel degree" information recorded in the register group 602 prior to the arithmetic processing. For example, when the degree of parallelism is 6, the arithmetic unit control data generation unit 512 generates the arithmetic unit parallel signal “11111100”. Each bit indicates a control signal, 1 corresponds to an operation calculator, and 0 corresponds to a stop calculator.

演算器制御部５１７はコンボリューション演算開始のタイミングでＥｎａｂｌｅ４信号を有効化し、当該制御信号をマスクレジスタにラッチする。ラッチした各信号（演算器を直接制御する信号５１３ａ〜５１３ｎ）はコンボリューション演算中の乗算器５０７ａ〜５０７ｎの動作を制御する。具体的には乗算器０〜乗算器５及び累積加算器０〜累積加算器５のみに演算動作をさせる。 The arithmetic unit control unit 517 activates the Enable4 signal at the timing of starting the convolution calculation, and latches the control signal in the mask register. Each latched signal (signals 513a to 513n that directly control the arithmetic unit) controls the operation of the multipliers 507a to 507n during the convolution calculation. Specifically, only the multiplier 0 to the multiplier 5 and the cumulative adder 0 to the cumulative adder 5 perform the calculation operation.

Ｌｏａｄ５信号は累積加算器の結果をシフトレジスタ５０６に並列にロードするための信号であり、制御部５０１は対象となる特徴面の並列処理単位の積和演算が終了するとＬｏａｄ５信号及びＥｎａｂｌｅ３信号の値に１を設定する。図１２は結合先の特徴面が１つの場合の例（即ち１組のコンボリューション演算のみで特徴面を算出する場合）を示す。 The Load5 signal is a signal for loading the result of the cumulative adder in parallel to the shift register 506, and the control unit 501 is the value of the Load5 signal and the Enable3 signal when the product-sum calculation of the parallel processing unit of the target feature surface is completed. Set to 1. FIG. 12 shows an example in which the feature plane of the coupling destination is one (that is, the feature plane is calculated by only one set of convolution operations).

シフトレジスタ５０６はＬｏａｄ５信号の値が１で、Ｅｎａｂｌｅ３信号の値が１である場合、累積加算器５０８の出力を一括ロードする。このタイミングでは計算済みのコンボリューション演算結果がシフトレジスタ５０６にラッチされる。制御部５０１はシフトレジスタ５０４及びシフトレジスタ５０５のシフト動作中に記憶部５０２及び記憶部５０３へのデータロードが完了している場合、Ｅｎａｂｌｅ３の信号を有効化し、シフトレジスタ５０６に保持する乗算結果をシフトアウトする。シフトアウトした演算結果は非線形変換処理部５０９で変換処理された後、制御部５０１により、レジスタ群６０２に記された演算結果格納先ポインタ及び参照データのサイズに従ってＲＡＭ５００の所定のアドレスに格納される。 When the value of the Load5 signal is 1 and the value of the Enable3 signal is 1, the shift register 506 collectively loads the output of the cumulative adder 508. At this timing, the calculated convolution calculation result is latched in the shift register 506. When the data loading to the storage unit 502 and the storage unit 503 is completed during the shift operation of the shift register 504 and the shift register 505, the control unit 501 activates the signal of Enable 3 and holds the multiplication result in the shift register 506. Shift out. The shifted-out calculation result is converted by the nonlinear conversion processing unit 509, and then stored by the control unit 501 at a predetermined address of the RAM 500 according to the size of the calculation result storage destination pointer and the reference data described in the register group 602. ..

ＲＡＭ５００から記憶部５０２及び記憶部５０３へのデータロードの間隙で制御部５０１はＥｎａｂｌｅ３信号を有効化して、シフトレジスタ５０６から演算結果をシフトアウトする。ここでも、制御部５０１はレジスタ群６０２に記された「並列度」情報に従ってシフトアウトするデータ数を制御する。並列度が６の場合、６演算結果をシフトアウトする。 The control unit 501 activates the Enable 3 signal in the gap of data loading from the RAM 500 to the storage unit 502 and the storage unit 503, and shifts out the calculation result from the shift register 506. Again, the control unit 501 controls the number of data to shift out according to the "parallelism" information recorded in the register group 602. When the degree of parallelism is 6, the result of 6 operations is shifted out.

以上のように、制御部５０１はレジスタ群６０２に記された「並列度」情報に従って並列度を決定し、各部を制御する。 As described above, the control unit 501 determines the degree of parallelism according to the "degree of parallelism" information written in the register group 602, and controls each unit.

制御部５０１は、記憶部５０２と記憶部５０３及び非線形変換処理部５０９の３つの処理部のＲＡＭ５００に対するアクセスを調停し、積和演算処理と当該３つの処理部のＲＡＭ５００へのアクセスをパイプライン化する。 The control unit 501 arbitrates the access to the RAM 500 of the three processing units of the storage unit 502, the storage unit 503, and the nonlinear conversion processing unit 509, and pipelines the multiply-accumulate processing and the access to the RAM 500 of the three processing units. To do.

なお、非線形変換処理部５０９は記憶部５０２及び記憶部５０３に比べてＲＡＭ５００に対するアクセス頻度が低いため最も低い優先順位で動作する。即ち、記憶部５０２及び記憶部５０３のアクセスの間隙となるタイムスロットで非線形変換処理部５０９のアクセスを行う。 The nonlinear conversion processing unit 509 operates in the lowest priority because the access frequency to the RAM 500 is lower than that of the storage unit 502 and the storage unit 503. That is, the non-linear conversion processing unit 509 is accessed in the time slot that is the access gap between the storage unit 502 and the storage unit 503.

図１４は本実施形態の演算回路２２を含む画像処理装置の動作を説明するフローチャートである。以下、フローチャートは、ＣＰＵ２７が制御プログラムを実行することにより実現されるものとする。なお、本実施形態ではパターン認識を行う画像処理装置を例に説明するが、本実施形態の画像処理装置はパターン認識処理に限らず、オブジェクト検出などの処理にも適用できる。 FIG. 14 is a flowchart illustrating the operation of the image processing apparatus including the arithmetic circuit 22 of the present embodiment. Hereinafter, it is assumed that the flowchart is realized by the CPU 27 executing the control program. In the present embodiment, an image processing device that performs pattern recognition will be described as an example, but the image processing device of the present embodiment can be applied not only to pattern recognition processing but also to processing such as object detection.

ステップＳ１０１では画像処理装置のパターン認識処理の開始に先立ち、ＣＰＵ２７が各種初期化処理を実行する。ＣＰＵ２７は、画像処理装置のＣＮＮ処理部である演算回路２２の動作に必要な重み係数をＲＯＭ２８からＲＡＭ５００に転送すると共に、演算回路２２の動作、即ちＣＮＮ処理のパラメータを定義する為の各種レジスタの設定を行う。具体的に、ＣＰＵ２７は演算回路２２の制御部５０１に存在する複数のレジスタ群６０２に所定の値を設定する。 In step S101, the CPU 27 executes various initialization processes prior to the start of the pattern recognition process of the image processing device. The CPU 27 transfers the weighting coefficient required for the operation of the arithmetic circuit 22 which is the CNN processing unit of the image processing apparatus from the ROM 28 to the RAM 500, and at the same time, the operation of the arithmetic circuit 22, that is, various registers for defining the parameters of the CNN processing. Make settings. Specifically, the CPU 27 sets a predetermined value in a plurality of register groups 602 existing in the control unit 501 of the arithmetic circuit 22.

更に、ステップＳ１０２はカーネルのサイズやデータ転送時間に基づいて並列度を設定する。並列度は、前述した様に、カーネルのサイズやメモリのアクセスサイクル等から予め決定するものである。ステップＳ１０２の処理が終了すると、ステップＳ１０３に進み、各ハードウェアモジュールが起動し、一連のパターン認識動作を開始する。 Further, step S102 sets the degree of parallelism based on the size of the kernel and the data transfer time. As described above, the degree of parallelism is determined in advance from the kernel size, the memory access cycle, and the like. When the process of step S102 is completed, the process proceeds to step S103, each hardware module is started, and a series of pattern recognition operations is started.

まず、ステップＳ１０４では画像入力モジュール２０が、画像センサーの出力する信号をディジタルデータに変換し、フレーム単位で図示しない（画像入力モジュール２０に内蔵する）フレームバッファに格納する。更に、フレームバッファへの格納が完了すると、所定の開始信号に基づいて、前処理モジュール２５が画像変換処理を開始する。前処理モジュール２５はフレームバッファ上の画像データから輝度データを抽出し、コントラスト補正処理を行う。輝度データは、線形変換処理により、ＲＧＢ画像データからＹＩＱ（ＮａｔｉｏｎａｌＴｅｌｅｖｉｓｉｏｎＳｔａｎｄａｒｄＣｏｍｍｉｔｔｅｅが規定する表色系）画像データに線形変換する事で生成する。コントラスト補正の手法は一般的に知られているコントラスト補正処理を適用して輝度データ（Ｙデータ）のコントラストを強調する。 First, in step S104, the image input module 20 converts the signal output by the image sensor into digital data and stores it in a frame buffer (built in the image input module 20), which is not shown in frame units. Further, when the storage in the frame buffer is completed, the preprocessing module 25 starts the image conversion process based on the predetermined start signal. The preprocessing module 25 extracts the luminance data from the image data on the frame buffer and performs contrast correction processing. Luminance data is generated by linearly converting RGB image data to YIQ (National Television Standard Committee) image data by a linear conversion process. In the contrast correction method, a generally known contrast correction process is applied to emphasize the contrast of the luminance data (Y data).

前処理モジュール２５はコントラスト補正処理後の輝度データを処理用画像としてＲＡＭ５００に格納する。更にステップＳ１０４では、顔候補検出モジュール３１が動作する。顔候補検出モジュール３１はフレームバッファに格納されたＹＩＱカラー画像データから肌色領域を特定し、特定結果を処理対象領域情報としてＲＡＭ５００に格納し、処理対象領域の画像の取得が終了する。１枚の画像データに対する処理が終了すると、顔候補検出モジュール３１は図示しない完了信号（割り込み信号）を有効にする。完了信号を受信したＣＰＵ２７は、次に、ステップＳ１０５に進み、演算回路２２を起動し、ＣＮＮ処理によって高精度なパターン認識処理を実行する。演算回路２２は、前処理モジュール２５の処理結果である補正後輝度画像データと顔候補検出モジュール３１の処理結果である処理対象領域に関する情報を利用して、処理対象領域の画像データに対してのみ並列にコンボリューション演算を実行する。 The preprocessing module 25 stores the luminance data after the contrast correction processing in the RAM 500 as a processing image. Further, in step S104, the face candidate detection module 31 operates. The face candidate detection module 31 identifies the skin color area from the YIQ color image data stored in the frame buffer, stores the specific result as the processing target area information in the RAM 500, and completes the acquisition of the image of the processing target area. When the processing for one image data is completed, the face candidate detection module 31 enables a completion signal (interrupt signal) (not shown). The CPU 27 that has received the completion signal then proceeds to step S105, activates the arithmetic circuit 22, and executes high-precision pattern recognition processing by CNN processing. The arithmetic circuit 22 uses the corrected luminance image data which is the processing result of the preprocessing module 25 and the information about the processing target area which is the processing result of the face candidate detection module 31 only for the image data of the processing target area. Performs convolution operations in parallel.

ステップＳ１０５において、演算回路２２は、レジスタ群６０２の設定値に従ってコンボリューション演算で使用するカーネルを選択する。更に、ステップＳ１０６において、レジスタ群６０２の設定値に従って、並列に動作する演算器を指定する。次に、演算回路２２は、ステップＳ１０７において、ステップＳ１０５で選択したコンボリューション演算用のカーネル（例えば図３に示す３０４１ａ〜ｂ等）を用いて、ステップＳ１０６で指定した複数の演算器でコンボリューション演算を並列に処理する。ステップＳ１０８において、演算回路２２は、全ての特徴面に対する処理が終了したかを判定する。全ての特徴面に対する処理が終了するとステップＳ１０９に進む。図３の例では、全ての特徴面に対する処理を終了する場合は、特徴面３０７の算出が終了した場合である。ステップＳ１０９において、演算回路２２は、ＣＰＵ２７に対して全ての特徴面に対する処理の終了を通知する割り込み信号を生成する。ステップＳ１１０において、ステップＳ１０４からステップＳ１０９までの処理を画像全てに対して実行する。 In step S105, the arithmetic circuit 22 selects the kernel to be used in the convolution arithmetic according to the set value of the register group 602. Further, in step S106, an arithmetic unit that operates in parallel is specified according to the set value of the register group 602. Next, in step S107, the arithmetic circuit 22 uses the kernel for convolution arithmetic (for example, 3041a to 3041b shown in FIG. 3) selected in step S105, and convolves with a plurality of arithmetic units specified in step S106. Process operations in parallel. In step S108, the arithmetic circuit 22 determines whether or not the processing for all the feature planes has been completed. When the processing for all the characteristic surfaces is completed, the process proceeds to step S109. In the example of FIG. 3, when the processing for all the feature planes is finished, the calculation of the feature plane 307 is finished. In step S109, the arithmetic circuit 22 generates an interrupt signal for notifying the CPU 27 of the end of processing for all characteristic surfaces. In step S110, the processes from step S104 to step S109 are executed for all the images.

ＣＰＵ２７は制御部５０１からの終了通知割り込みを受信すると、ＤＭＡＣ２６を起動し、ＲＡＭ５００上の最終特徴面データをＣＰＵバス３０上のＲＡＭ２９に転送する。ＣＰＵ２７はＲＡＭ５００におかれた最終層検出結果から検出対象である所定の物体の位置や大きさなどの情報を取得する。具体的には最終検出結果を二値化処理しラベリング等の処理によりオブジェクト位置やサイズを抽出する。 Upon receiving the end notification interrupt from the control unit 501, the CPU 27 activates the DMAC 26 and transfers the final feature plane data on the RAM 500 to the RAM 29 on the CPU bus 30. The CPU 27 acquires information such as the position and size of a predetermined object to be detected from the final layer detection result placed in the RAM 500. Specifically, the final detection result is binarized and the object position and size are extracted by processing such as labeling.

一般的なコンボリューションの並列演算回路では、カーネルに対して演算器を割り付ける構成を取ることが多い。図２０はカーネルサイズが２×２の４つの係数を４つの並列乗算器２１００で並列処理する場合の例を示している。この場合、カーネルと並列演算器が依存関係を有しているため、参照データの供給能力等に応じて同時に動作する演算器の数（並列度）を変えることは容易ではない。一方、本実施形態の構成では、生成する特徴面の画素位置毎の並列処理であるため、そもそもカーネルと演算器に依存関係がなく、簡単な制御で動作条件に応じて演算の並列度を変更することができる。 In a general convolution parallel computing circuit, an arithmetic unit is often assigned to the kernel. FIG. 20 shows an example in which four coefficients having a kernel size of 2 × 2 are processed in parallel by four parallel multipliers 2100. In this case, since the kernel and the parallel computing unit have a dependency relationship, it is not easy to change the number of computing units (degree of parallelism) that operate simultaneously according to the reference data supply capacity and the like. On the other hand, in the configuration of the present embodiment, since the parallel processing is performed for each pixel position of the characteristic surface to be generated, there is no dependency between the kernel and the arithmetic unit in the first place, and the degree of parallelism of the arithmetic is changed according to the operating conditions with simple control. can do.

本実施形態によれば、ＣＮＮ処理を並列に高速処理する並列演算回路において、カーネルのサイズや参照データの供給能力に基づいて並列に演算する演算回路２２の並列度を決定する。これによって、演算回路２２の演算処理速度が低下することなく、無駄な回路動作を抑えて消費電力を低減させることが可能になる。なお、本実施形態では、データ転送がボトルネックになる場合の例として、カーネルのサイズが小さい場合について説明したが、これに限定せず、データ転送がボトルネックになる他の場合にも本件を適用できる。 According to the present embodiment, in a parallel computing circuit that performs high-speed processing in parallel in parallel, the degree of parallelism of the computing circuit 22 that performs parallel computing based on the size of the kernel and the supply capacity of reference data is determined. As a result, it is possible to suppress unnecessary circuit operation and reduce power consumption without reducing the arithmetic processing speed of the arithmetic circuit 22. In the present embodiment, the case where the kernel size is small has been described as an example of the case where data transfer becomes a bottleneck, but the present case is not limited to this, and this case also applies to other cases where data transfer becomes a bottleneck. Applicable.

（第２の実施形態）
本実施形態の画像処理装置の構成は、第１の実施形態と同じであるので、以下では、本実施形態が第１の実施形態と異なる処理について説明。図１５は本発明の第２の実施形態の画像処理装置の処理例を模式的に説明する図である。図１５の処理では、画像処理装置は、入力画像における顔候補領域のデータサイズに応じて演算回路の並列度を変更する。 (Second Embodiment)
Since the configuration of the image processing apparatus of the present embodiment is the same as that of the first embodiment, the processing in which the present embodiment is different from that of the first embodiment will be described below. FIG. 15 is a diagram schematically illustrating a processing example of the image processing apparatus according to the second embodiment of the present invention. In the process of FIG. 15, the image processing device changes the degree of parallelism of the arithmetic circuit according to the data size of the face candidate region in the input image.

ここで、入力画像中の顔画像領域を認識する処理を例に説明する。入力画像データ１５０１は、画像入力モジュール２０から入力される画像データである。入力画像データ１５０１は顔候補検出モジュール３１による顔候補の検出処理１５０２を経て、コンボリューション演算１５０６で高精度な認識処理を行う領域に関する領域限定情報１５０３（座標情報）が生成される。領域限定情報１５０３がコンボリューション演算１５０６の入力画像データ１５０１における処理対象領域に関する情報である。座標データ１５０４ａと座標データ１５０４ｂとに示される領域（白抜き領域）にあるデータが入力画像データ１５０１の処理対象データとなる。顔候補の検出処理１５０２では、処理負荷の低い簡単な処理で入力画像における顔候補領域を取得する。顔候補領域の取得法としては、特定の色情報を含む矩形領域や楕円領域を抽出する方法や、前フレームの抽出結果（動画像に適用した場合）を利用する方法等、様々な方法が適用可能である。顔候補領域の画素データは、コンボリューション演算１５０６によって演算処理する処理対象データである。 Here, the process of recognizing the face image area in the input image will be described as an example. The input image data 1501 is image data input from the image input module 20. The input image data 1501 undergoes the face candidate detection process 1502 by the face candidate detection module 31, and the area-limited information 1503 (coordinate information) regarding the area for which the high-precision recognition process is performed by the convolution calculation 1506 is generated. The area limitation information 1503 is information regarding the processing target area in the input image data 1501 of the convolution calculation 1506. The data in the area (white area) shown in the coordinate data 1504a and the coordinate data 1504b becomes the processing target data of the input image data 1501. In the face candidate detection process 1502, the face candidate area in the input image is acquired by a simple process with a low processing load. As a method for acquiring the face candidate area, various methods such as a method of extracting a rectangular area or an elliptical area containing specific color information and a method of using the extraction result of the previous frame (when applied to a moving image) are applied. It is possible. The pixel data of the face candidate region is the processing target data to be calculated by the convolution calculation 1506.

本実施形態では、顔候補領域の大きさに基づいて、演算回路２２は各演算器１５０７の実行を制御する。ここで、コンボリューション演算１５０６は図２に示す演算回路によって実行される。具体的に、並列に演算される特徴面の各画素１５０７は図２における乗算器５０７ａ〜５０７ｎ及び累積加算器５０８ａ〜５０８ｎによって算出され、各演算器１５０７の制御は図２における演算器制御部５１７によって実行される。また、図１５では並列演算回路の並列度が８の場合の例を示している。すなわち、並列に処理する各演算器１５０７は水平方向に連続する８個の注目画素位置のコンボリューション演算処理を同時に実行する。なお、図１５では、各演算器１５０７の位置を中心とした画素領域の入力画像データを参照してコンボリューション演算を行う。 In the present embodiment, the arithmetic circuit 22 controls the execution of each arithmetic unit 1507 based on the size of the face candidate region. Here, the convolution calculation 1506 is executed by the calculation circuit shown in FIG. Specifically, each pixel 1507 of the feature surface calculated in parallel is calculated by the multipliers 507a to 507n and the cumulative adders 508a to 508n in FIG. 2, and the control of each arithmetic unit 1507 is controlled by the arithmetic unit control unit 517 in FIG. Is executed by. Further, FIG. 15 shows an example when the degree of parallelism of the parallel arithmetic circuit is 8. That is, each arithmetic unit 1507 that processes in parallel simultaneously executes convolution arithmetic processing of eight pixel positions of interest that are continuous in the horizontal direction. In FIG. 15, the convolution calculation is performed with reference to the input image data of the pixel region centered on the position of each arithmetic unit 1507.

各演算器１５０７は８画素単位で演算処理を繰り返し、画像データをスキャンする事で、一枚の画像データに対するコンボリューション演算を完了する。なお、ＣＮＮ処理を実行する場合、複数の特徴面を算出するための複数のフィルタ係数（カーネル係数）を用いてコンボリューション演算を繰り返し実行する。 Each arithmetic unit 1507 repeats the arithmetic processing in units of 8 pixels and scans the image data to complete the convolution arithmetic on one image data. When the CNN process is executed, the convolution operation is repeatedly executed using a plurality of filter coefficients (kernel coefficients) for calculating a plurality of feature planes.

なお、本実施形態の演算回路では顔の認識処理に限らず、他の物体の認識処理にも適用できる。他の物体の認識処理では、顔候補領域の検出処理を認識対象物体の候補領域の検出処理に置き換えて、同様に処理することができる。 The arithmetic circuit of the present embodiment can be applied not only to the face recognition process but also to the recognition process of other objects. In the recognition process of other objects, the detection process of the face candidate region can be replaced with the detection process of the candidate region of the recognition target object, and the same processing can be performed.

本実施形態では、演算器制御信号生成部５１７は、顔候補領域の検出処理１５０２で得られた顔候補領域（顔候補領域を示す座標データ）から並列に処理する各演算器を制御するための信号を生成する。演算器制御信号生成部５１７では各演算器１５０７の動作に同期し、かつ演算の種別（コンボリューションのカーネルサイズ等）に応じたタイミングで制御信号を出力する。 In the present embodiment, the arithmetic unit control signal generation unit 517 controls each arithmetic unit to be processed in parallel from the face candidate region (coordinate data indicating the face candidate region) obtained in the face candidate region detection process 1502. Generate a signal. The arithmetic unit control signal generation unit 517 outputs a control signal in synchronization with the operation of each arithmetic unit 1507 and at a timing corresponding to the operation type (convolution kernel size, etc.).

以下、図１６に示すフローチャートを用いて本実施形態の画像処理装置の処理を説明する。なお、画像処理装置の構成及び演算回路２２の構成は夫々第１の実施形態と同じである。 Hereinafter, the processing of the image processing apparatus of the present embodiment will be described with reference to the flowchart shown in FIG. The configuration of the image processing device and the configuration of the arithmetic circuit 22 are the same as those of the first embodiment, respectively.

ステップＳ１６０１では画像処理装置の処理開始に先立ち、ＣＰＵ２７が各種初期化処理を実行する。ＣＰＵ２７は、画像処理装置のＣＮＮ処理部である演算回路２２の動作に必要な重み係数をＲＯＭ２８からＲＡＭ５００に転送すると共に、演算回路２２の動作、即ちＣＮＮ処理のパラメータを定義する為の各種レジスタ設定を行う。具体的にはＣＰＵ２７は演算回路２２の制御部５０１に存在する複数のレジスタ群６０２に所定の値を設定する。 In step S1601, the CPU 27 executes various initialization processes prior to the start of processing by the image processing device. The CPU 27 transfers the weighting coefficient required for the operation of the arithmetic circuit 22 which is the CNN processing unit of the image processing apparatus from the ROM 28 to the RAM 500, and sets various registers for defining the operation of the arithmetic circuit 22, that is, the parameters of the CNN processing. I do. Specifically, the CPU 27 sets a predetermined value in a plurality of register groups 602 existing in the control unit 501 of the arithmetic circuit 22.

各種初期化が終了すると、ステップＳ１６０２に進み、ＣＰＵ２７は各ハードウェアモジュールを起動し、一連の顔認識処理を開始する。まず、ステップＳ１６０３では画像入力モジュール２０が、画像センサーの出力する信号をディジタルデータに変換し、フレーム単位で図示しない（画像入力モジュール２０に内蔵する）フレームバッファに格納する。更に、フレームバッファへの格納が完了すると、所定の開始信号に基づいて、前処理モジュール２５が画像変換処理を開始する。前処理モジュール２５は前記フレームバッファ上の画像データから輝度データを抽出し、コントラスト補正処理を行う。前処理モジュール２５はコントラスト補正処理後の輝度データを検出用画像としてＲＡＭ５００に格納する。 When the various initializations are completed, the process proceeds to step S1602, and the CPU 27 starts each hardware module to start a series of face recognition processes. First, in step S1603, the image input module 20 converts the signal output by the image sensor into digital data and stores it in a frame buffer (built in the image input module 20), which is not shown in frame units. Further, when the storage in the frame buffer is completed, the preprocessing module 25 starts the image conversion process based on the predetermined start signal. The preprocessing module 25 extracts luminance data from the image data on the frame buffer and performs contrast correction processing. The preprocessing module 25 stores the luminance data after the contrast correction processing in the RAM 500 as a detection image.

更に、ステップＳ１６０３では、顔候補検出モジュール３１が動作する。顔候補検出モジュール３１はフレームバッファに格納されたＹＩＱカラー画像データから肌色領域を特定定し、画像データにおける肌色領域を含む矩形領域を顔候補領域として取得する。また、コンボリューション演算のための顔候補領域を含む処理対象領域を取得する。１枚の画像データに対する領域取得処理が終了すると、顔候補検出モジュール３１は図示しない完了信号（割り込み信号）を有効にする。完了信号を受信したＣＰＵ２７は次に、ステップＳ１６０４に進み、演算回路２２を起動し、高精度な顔認識処理を実行する。演算回路２２は、前処理モジュール２５で得られた補正後輝度画像データと、顔候補検出モジュール３１で取得した顔候補領域（処理対象領域）を利用して、処理対象領域の画像データに対してのみ並列にコンボリューション演算処理を実行する。本実施形態では、顔候補検出モジュール３１は、入力画像データにおける肌色領域を顔候補領域として取得する場合について説明したが、認識対象の形状情報等の他の特徴を利用した様々な方法を適用して認識対象の候補領域を取得することが可能である。また、外部から指定された領域限定情報によって認識対象の候補領域を取得してもよい。例えば、ＣＰＵ２７が所定の条件に従って領域限定情報を予め設定し、所定の小領域を切り出して認識対象の候補領域として取得することができる。また、候補領域が矩形領域に限らず、楕円等他の形状で表現することもできる。 Further, in step S1603, the face candidate detection module 31 operates. The face candidate detection module 31 identifies and determines the skin color region from the YIQ color image data stored in the frame buffer, and acquires a rectangular region including the skin color region in the image data as the face candidate region. In addition, the processing target area including the face candidate area for the convolution calculation is acquired. When the area acquisition process for one image data is completed, the face candidate detection module 31 enables a completion signal (interrupt signal) (not shown). The CPU 27 that has received the completion signal then proceeds to step S1604, activates the arithmetic circuit 22, and executes high-precision face recognition processing. The arithmetic circuit 22 uses the corrected luminance image data obtained by the preprocessing module 25 and the face candidate area (processing target area) acquired by the face candidate detection module 31 to obtain the image data of the processing target area. Only performs convolution arithmetic processing in parallel. In the present embodiment, the case where the face candidate detection module 31 acquires the skin color region in the input image data as the face candidate region has been described, but various methods utilizing other features such as the shape information of the recognition target are applied. It is possible to acquire the candidate area to be recognized. Further, the candidate area to be recognized may be acquired by the area-limited information designated from the outside. For example, the CPU 27 can set the area limitation information in advance according to a predetermined condition, cut out a predetermined small area, and acquire it as a candidate area to be recognized. Further, the candidate area is not limited to the rectangular area, and can be expressed by another shape such as an ellipse.

本実施形態では、まず、ステップＳ１６０４では、顔候補検出モジュール３１は、ステップＳ１６０３で取得した顔候補領域から顔候補領域のサイズを特定する。ここで特定する顔候補領域のサイズは、並列演算の方向と同じ方向の顔候補領域のサイズ（顔候補領域の幅方向の画素数或いは高さ方向の画素数）である。次に、ステップＳ１６０５では、取得したサイズに応じて、並列演算する際の並列度を決定する。 In the present embodiment, first, in step S1604, the face candidate detection module 31 specifies the size of the face candidate area from the face candidate area acquired in step S1603. The size of the face candidate region specified here is the size of the face candidate region in the same direction as the direction of the parallel calculation (the number of pixels in the width direction or the number of pixels in the height direction of the face candidate region). Next, in step S1605, the degree of parallelism at the time of parallel calculation is determined according to the acquired size.

図１７は本実施形態における並列度の決定方法について説明する図である。並列度を決定する際に、参照データとなる入力画像データ及び演算結果である特徴面データの何れを用いても同じ結果であるが、説明を簡単にするために、図１７では演算結果である特徴面データを用いて説明する。図１７（ａ）は従来の並列演算の演算結果を示す図である。画像データ１７０１は演算結果の画像データ面（特徴面）を示す。図１７（ａ）及び。図１７（ｂ）の白抜きの領域（マスク）は演算結果の画像データ面１７０１において、入力データ画像における顔候補領域に対応する特徴面の領域１７０２を示す。この白抜きの領域は、演算回路２２が顔候補領域のデータを参照してコンボリューション演算処理して算出する特徴面であって、その大きさは入力画像データにおける顔候補領域の大きさと対応するものである。以下、この白抜きの領域を算出対象領域と称する。 FIG. 17 is a diagram illustrating a method of determining the degree of parallelism in the present embodiment. When determining the degree of parallelism, the same result is obtained regardless of whether the input image data as the reference data or the feature plane data which is the calculation result is used, but for the sake of simplicity, the calculation result is shown in FIG. This will be described using feature plane data. FIG. 17A is a diagram showing an operation result of a conventional parallel operation. The image data 1701 indicates an image data surface (feature surface) of the calculation result. FIG. 17 (a) and. The white area (mask) in FIG. 17B shows the area 1702 of the feature surface corresponding to the face candidate area in the input data image on the image data surface 1701 of the calculation result. This white area is a characteristic surface calculated by the arithmetic circuit 22 by convolution calculation processing with reference to the data of the face candidate area, and its size corresponds to the size of the face candidate area in the input image data. It is a thing. Hereinafter, this white area is referred to as a calculation target area.

図１７（ａ）は、並列に演算可能な演算器（乗算器５０７ａ〜５０７ｎ及び累積加算器５０８ａ〜５０８ｎ）の数が８の場合を示す。算出対象領域１７０２に対する演算を並列に処理する場合、ブロック１７０３ａ及びブロック１７０３ｂのそれぞれで示す８個の演算器による２回の繰返しの処理で当該算出対象領域１７０２の１行の画素を算出する。しかしながら、ブロック１７０３ｂで示す８個の演算器が処理する場合、必要な演算は演算器番号０及び１の２点の演算器であり、演算器番号２〜７で示す演算器は不要な演算処理を実行する。 FIG. 17A shows a case where the number of arithmetic units (multipliers 507a to 507n and cumulative adders 508a to 508n) that can be calculated in parallel is eight. When the operations on the calculation target area 1702 are processed in parallel, one row of pixels in the calculation target area 1702 is calculated by two repetitions of processing by the eight arithmetic units shown in each of the block 1703a and the block 1703b. However, when the eight arithmetic units shown in the block 1703b process, the necessary operations are the two arithmetic units of the arithmetic unit numbers 0 and 1, and the arithmetic units shown by the arithmetic unit numbers 2 to 7 are unnecessary arithmetic processing. To execute.

本実施形態では図１７（ｂ）に示す様に、算出対象領域１７０２のサイズに応じて演算回路２２の並列度を決定する。図１７（ｂ）の場合、並列度を５に設定する。この場合、ブロック１７０４ａで示す５個の演算器及びブロック１７０４ｂで示す５個の演算器のどちらも、演算器（乗算器５０７ａ〜５０７ｎ及び累積加算器５０８ａ〜５０８ｎ）は必要な演算のみを実行する。この場合は、並列度を８とした場合と同様に、２回の繰返し処理で算出対象領域の一行の画素を算出可能である。即ち、並列度を下げた場合であっても、処理速度は同じである。 In the present embodiment, as shown in FIG. 17B, the degree of parallelism of the arithmetic circuit 22 is determined according to the size of the calculation target region 1702. In the case of FIG. 17B, the degree of parallelism is set to 5. In this case, in both the five arithmetic units shown in block 1704a and the five arithmetic units shown in block 1704b, the arithmetic units (multipliers 507a to 507n and cumulative adders 508a to 508n) perform only necessary operations. .. In this case, as in the case where the degree of parallelism is 8, it is possible to calculate one line of pixels in the calculation target area by repeating the process twice. That is, the processing speed is the same even when the degree of parallelism is lowered.

算出対象領域のサイズが並列に動作可能な演算器の数（最大並列度）の整数倍でない場合、最適な並列度は、例えば、以下の式で算出できる。 When the size of the calculation target area is not an integral multiple of the number of arithmetic units that can operate in parallel (maximum degree of parallelism), the optimum degree of parallelism can be calculated by, for example, the following formula.

最適な並列度＝ｎ−ＩＮＴ（（ｎ−ｍ）／（ＩＮＴ（ｗ／ｎ）＋１）） ‥ （４）
ここで、Ｗは算出対象領域のサイズを示し、ｎは並列に動作可能な演算器の数（最大並列度）を示し、ｍはｗ／ｎの剰余を示し、ＩＮＴ（ｘ）は数値ｘを超えない最大の整数を返す関数を示す。 Optimal degree of parallelism = n-INT ((n-m) / (INT (w / n) +1)) ... (4)
Here, W indicates the size of the calculation target area, n indicates the number of arithmetic units that can operate in parallel (maximum degree of parallelism), m indicates the remainder of w / n, and INT (x) indicates the numerical value x. Indicates a function that returns the largest integer that does not exceed.

式（４）に従えば、処理速度が低下する事なく、最も低い並列度で演算処理することが可能になる。即ち、処理速度を低下させることなく、消費電力を低減することができる。 According to the equation (4), it is possible to perform arithmetic processing with the lowest degree of parallelism without reducing the processing speed. That is, the power consumption can be reduced without reducing the processing speed.

また、上記式（４）を用いて並列度を毎回算出するのではなく、予め算出した値を保持したデータテーブルに、演算器制御部５１７に含まれる演算器制御データ生成部５１２が参照することで並列度を決定することも可能である。例えば、算出対象領域の幅又は高さと演算回路２２の並列度との対応関係を保持するデータテーブルがＲＡＭ５００に保持されたとする。演算器制御データ生成部５１２が、ＲＡＭ５００に保持されたテーブルにアクセスすることによって、演算回路２２の並列度を制御することはできる。図１８はデータテーブルの１例であり、算出対象領域の幅が９から１２までの間にある場合、並列度を低くになるように制御する。特徴面における算出対象領域のデータサイズは、入力画像データにおける顔候補領域のデータサイズと対応関係にあるので、顔候補領域のデータサイズと並列度との対応関係を保持するテーブルを用いることができる。この場合は、顔候補検出モジュール３１で取得した顔候補領域のデータサイズから直接に演算回路２２の並列度を制御することができるので、処理が簡単である。 Further, instead of calculating the degree of parallelism each time using the above equation (4), the arithmetic unit control data generation unit 512 included in the arithmetic unit control unit 517 refers to the data table holding the pre-calculated values. It is also possible to determine the degree of parallelism with. For example, it is assumed that the RAM 500 holds a data table that holds a correspondence between the width or height of the calculation target area and the degree of parallelism of the arithmetic circuit 22. The arithmetic unit control data generation unit 512 can control the degree of parallelism of the arithmetic circuit 22 by accessing the table held in the RAM 500. FIG. 18 is an example of a data table, and when the width of the calculation target area is between 9 and 12, the degree of parallelism is controlled to be low. Since the data size of the calculation target area on the feature surface has a correspondence relationship with the data size of the face candidate area in the input image data, a table that holds the correspondence relationship between the data size of the face candidate area and the degree of parallelism can be used. .. In this case, the degree of parallelism of the arithmetic circuit 22 can be controlled directly from the data size of the face candidate region acquired by the face candidate detection module 31, so that the process is simple.

このように、演算器制御部５１７がテーブルを利用する場合、本実施形態に係る演算回路２２の制御処理をより少ない回路規模で実現する事ができる。 As described above, when the arithmetic unit control unit 517 uses the table, the control processing of the arithmetic circuit 22 according to the present embodiment can be realized on a smaller circuit scale.

並列度の制御は、第１の実施形態と同様に演算器制御部５１７が乗算器５０７ａ〜５０７ｎの動作を制御することで実現する。また、制御部５０１は演算器制御部５１７で決定した並列度に従って、演算処理全体の動作を制御する。 The control of the degree of parallelism is realized by controlling the operation of the multipliers 507a to 507n by the arithmetic unit control unit 517 as in the first embodiment. Further, the control unit 501 controls the operation of the entire arithmetic processing according to the degree of parallelism determined by the arithmetic unit control unit 517.

ステップＳ１６０６において、演算回路２２は決定した並列度でコンボリューション演算を実行する。ステップＳ１６０７において、全ての特徴面に対する演算処理が終了すると、ステップＳ１６０８に進み、演算回路２２は、ＣＰＵ２７に対して割り込み信号を生成する。なお、全ての特徴面に対する演算処理が終了する場合は、図３に示す特徴面３０７の算出が終了した場合である。 In step S1606, the arithmetic circuit 22 executes the convolution arithmetic at the determined degree of parallelism. When the arithmetic processing for all the feature planes is completed in step S1607, the process proceeds to step S1608, and the arithmetic circuit 22 generates an interrupt signal to the CPU 27. It should be noted that the calculation processing for all the feature planes is completed when the calculation of the feature plane 307 shown in FIG. 3 is completed.

ステップＳ１６０９において、ステップＳ１６０３からステップＳ１６０８までの処理を画像全てに対して実行する。 In step S1609, the processes from step S1603 to step S1608 are executed for all the images.

以上、本実施形態によれば、算出対象領域のサイズ（並列に処理する方向と同じ方向の算出対象領域のサイズ）に従って並列度を決定することで、処理速度を維持しながら、消費電力を低減させることが可能になる。 As described above, according to the present embodiment, the degree of parallelism is determined according to the size of the calculation target area (the size of the calculation target area in the same direction as the parallel processing direction), thereby reducing the power consumption while maintaining the processing speed. It becomes possible to make it.

本実施形態は、画像入力モジュール２０から入力された一枚の入力画像データに対して、顔候補検出モジュール３１が複数の肌色領域を特定し、一枚の入力画像データから複数の顔候補領域を取得し、処理対象領域を取得することがある。この場合は、複数の処理対象領域のそれぞれに対して、逐次に図１６に示す処理を行うことによって、全ての処理対象領域に対してＣＮＮ処理を行うことができる。この場合の演算回路の並列度の制御は上述した通りである。 In the present embodiment, the face candidate detection module 31 identifies a plurality of skin color regions with respect to one input image data input from the image input module 20, and a plurality of face candidate regions are generated from one input image data. It may be acquired and the processing target area may be acquired. In this case, the CNN processing can be performed on all the processing target areas by sequentially performing the processing shown in FIG. 16 on each of the plurality of processing target areas. The control of the degree of parallelism of the arithmetic circuit in this case is as described above.

また、複数の処理対象領域をまとめて図１６に示す処理を行うことによって全ての処理対象領域に対してＣＮＮ処理を行うことができる。この場合、演算器制御部５１７が処理対象領域の数に応じて並列度を制御する。 Further, by collectively performing the processing shown in FIG. 16 for a plurality of processing target areas, CNN processing can be performed on all the processing target areas. In this case, the arithmetic unit control unit 517 controls the degree of parallelism according to the number of processing target areas.

例えば、複数の処理対象領域の数が少ない場合、演算回路２２の処理時間が規定する時間より短い可能性がある。この場合、演算回路２２の並列度を下げることで消費電力を低減させることができる。つまり、画像処理装置が規定する処理時間内で処理可能な最低の並列度を設定する。 For example, when the number of a plurality of processing target areas is small, the processing time of the arithmetic circuit 22 may be shorter than the specified time. In this case, the power consumption can be reduced by lowering the degree of parallelism of the arithmetic circuit 22. That is, the minimum degree of parallelism that can be processed within the processing time specified by the image processing apparatus is set.

並列度の決定は、例えば、演算器制御データ生成部５１２にテーブルデータを持たせることで簡単に実現可能である。即ち、処理対象領域の数と対応する並列度との対応関係を保持するテーブルデータをＲＡＭ５００に保持することが可能である。或いは、所定の判定しきい値を設定する複数のレジスタを有し、当該レジスタ値との比較により決定する等の方法でも良い。制御部５０１は演算器制御部５１７で決定した並列度に従って、演算処理全体の動作を制御する。 The degree of parallelism can be easily determined, for example, by having the arithmetic unit control data generation unit 512 have table data. That is, it is possible to hold the table data that holds the correspondence between the number of processing target areas and the corresponding degree of parallelism in the RAM 500. Alternatively, a method of having a plurality of registers for setting a predetermined determination threshold value and determining by comparison with the register value may be used. The control unit 501 controls the operation of the entire arithmetic processing according to the degree of parallelism determined by the arithmetic unit control unit 517.

以上のように、一枚の画像において複数の処理対象領域がある場合でも、処理対象領域の数に従って並列度を決定することで、消費電力を低減させる事が可能になる。 As described above, even when there are a plurality of processing target areas in one image, it is possible to reduce the power consumption by determining the degree of parallelism according to the number of processing target areas.

（第３の実施形態）
本実施形態の画像処理装置の構成は、第１の実施形態と同じであるので、以下では、本実施形態が第１の実施形態と異なる処理について説明する。本実施形態の階層的なＣＮＮ処理では、階層的に複数回のフィルタ演算処理を行うが、不図示の処理部が階層毎にサブサンプリング処理を行い、サブサンプリング処理した特徴面をＲＡＭ５００に保持させる。階層的なＣＮＮ処理において、サブサンプリング処理は前階層のコンボリューション演算結果を間引いて次階層の参照データとする場合とコンボリューション演算結果に対してプーリング処理して次階層の参照データとする場合がある。プーリング処理は前階層で生成した特徴面に対して平均値フィルタや最大値フィルタを用いて特徴面を縮小する。本実施形態の処理部がサブサンプリング処理又はプーリング処理の何れか又は両方の処理を行うものである。 (Third Embodiment)
Since the configuration of the image processing apparatus of the present embodiment is the same as that of the first embodiment, the processing in which the present embodiment is different from that of the first embodiment will be described below. In the hierarchical CNN process of the present embodiment, the filter calculation process is performed hierarchically a plurality of times, but a processing unit (not shown) performs a subsampling process for each layer to hold the subsampling processed feature surface in the RAM 500. .. In the hierarchical CNN process, the subsampling process may be the case where the convolution operation result of the previous layer is thinned out to be the reference data of the next layer, or the case where the convolution operation result is pooled to be the reference data of the next layer. is there. In the pooling process, the feature surface generated in the previous layer is reduced by using an average value filter or a maximum value filter. The processing unit of the present embodiment performs either or both of the subsampling processing and the pooling processing.

一般的にサブサンプル処理では特徴面を水平方向・垂直方向に１／２倍あるいは１／４倍等に縮小処理する。ＣＮＮ処理では処理部がサブサンプリング処理等の処理を行うために、後段の特徴面のサイズが入力画像に対して小さくなることが多い。図１９は処理部がサブサンプリング処理を行った場合の特徴面の例を示す図である。特徴面２００１、特徴面２００２及び特徴面２００３はそれぞれ第１層、第２層、第３層の処理対象の特徴面であり、特徴面２００２は水平・垂直共に１／２倍にサブサンプリングし、特徴面２００３では更に１／２倍にサブサンプリングされている。この様な場合、並列処理領域２００４に示すように、水平方向８並列の演算器では、特徴面２００２及び特徴２００３に対して処理する場合、無駄な演算動作が発生する。なお、処理部がプーリング処理を行う場合でも同様である。 Generally, in the subsample processing, the feature plane is reduced to 1/2 times or 1/4 times in the horizontal and vertical directions. In the CNN process, since the processing unit performs processing such as subsampling processing, the size of the feature surface in the subsequent stage is often smaller than that of the input image. FIG. 19 is a diagram showing an example of a characteristic surface when the processing unit performs subsampling processing. The feature surface 2001, the feature surface 2002, and the feature surface 2003 are the feature surfaces to be processed in the first layer, the second layer, and the third layer, respectively, and the feature surface 2002 is subsampled 1/2 times both horizontally and vertically. On the feature surface 2003, the subsampling is further performed 1/2 times. In such a case, as shown in the parallel processing area 2004, in the horizontal direction 8 parallel arithmetic unit, when processing the feature surface 2002 and the feature 2003, a useless arithmetic operation occurs. The same applies when the processing unit performs the pooling process.

本実施形態では、演算器制御データ生成部５１２は、階層毎のサブサンプリングの状況に基づいて並列に動作する演算器の数（並列度）を制御する。即ち、演算回路２２が特徴面２００２に対して処理する場合、演算器６及び演算器７の動作を停止し、並列度６で処理する。また、演算回路２２が特徴面２００３に対して処理する場合、演算器３〜７の動作を停止し、並列度３で処理する。 In the present embodiment, the arithmetic unit control data generation unit 512 controls the number of arithmetic units (degree of parallelism) operating in parallel based on the subsampling situation for each layer. That is, when the arithmetic circuit 22 processes the feature surface 2002, the operations of the arithmetic unit 6 and the arithmetic unit 7 are stopped, and the processing is performed with a degree of parallelism of 6. Further, when the arithmetic circuit 22 processes the feature surface 2003, the operations of the arithmetic units 3 to 7 are stopped and the processing is performed with a parallel degree of 3.

以上、本実施形態の並列演算回路は並列に動作する演算器の個数単位で並列度を制御可能であるため、演算対象データ（参照データ）のサイズに応じて演算器を有効に活用して消費電力を低減することが可能である。つまり、処理対象領域（参照データ）となる特徴面の大きさに基づいて、並列に動作する演算器の数（並列度）を制御することで、無駄な演算動作を排除し、消費電力を低減することができる。 As described above, since the parallel computing circuit of the present embodiment can control the degree of parallelism in units of the number of computing units operating in parallel, the computing units are effectively utilized and consumed according to the size of the calculation target data (reference data). It is possible to reduce power consumption. In other words, by controlling the number of arithmetic units operating in parallel (degree of parallelism) based on the size of the feature surface that is the processing target area (reference data), unnecessary arithmetic operations are eliminated and power consumption is reduced. can do.

（その他の実施形態）
本発明は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 (Other embodiments)
The present invention supplies a program that realizes one or more functions of the above-described embodiment to a system or device via a network or storage medium, and one or more processors in the computer of the system or device reads and executes the program. It can also be realized by the processing to be performed. It can also be realized by a circuit (for example, ASIC) that realizes one or more functions.

５００ＲＡＭ
５０１制御部
５０２記憶部
５０３記憶部
５０４シフトレジスタ
５０５シフトレジスタ
５０６シフトレジスタ
５０７ａ〜５０７ｎ乗算器
５０８ａ〜５０８ｎ累積加算器
５０９非線形変換部 500 RAM
501 Control unit 502 Storage unit 503 Storage unit 504 Shift register 505 Shift register 506 Shift register 507a to 507n Multiplier 508a to 508n Cumulative adder 509 Non-linear conversion unit

Claims

An arithmetic circuit connected to a holding device that holds reference data for filter arithmetic processing and coefficient data of a filter used for the filter arithmetic processing.
A plurality of multipliers that execute the filter calculation process by repeatedly multiplying the reference data that is different from each other and the coefficient data that is common to each other.
A first data supply means for supplying the plurality of multipliers with different reference data held in the holding device.
A second data supply means for supplying the common coefficient data held in the holding device to the plurality of multipliers, and
The time for the multiplier to repeatedly execute the multiplication increases as the filter size of the filter calculation process increases, and when the filter size is equal to or less than a predetermined value, a power of a part of the plurality of multipliers is used. An arithmetic circuit having a control means for controlling an arithmetic unit to execute the multiplication and not causing another multiplier to execute the multiplication.

A first storage held in the holding device for temporarily storing the reference data to be supplied to the multiplier for performing the multiplication, and the stored reference data being loaded into the first data supply means. Means and
A second storage means that temporarily stores the coefficient data held in the holding device for supplying to the filter size and loads the stored coefficient data into the second data supply means.
The arithmetic circuit according to claim 1, further comprising.

The holding device holds a table showing the correspondence between the filter size and the number of multipliers that perform the multiplication.
The arithmetic circuit according to claim 2, wherein the control means refers to the table and determines the number of the multipliers that perform the multiplication based on the filter size.

A claim, wherein the number of reference data loaded from the holding device into the first data supply means is one less than the sum of the filter size and the number of multipliers performing the multiplication. The arithmetic circuit according to any one of 1 to 3.

The arithmetic circuit according to any one of claims 1 to 4, wherein each of the first data supply means and the second data supply means includes a shift register.

1. The control means is characterized in that, by outputting an arithmetic control signal to each of the plurality of multipliers, the multiplier is controlled to execute the multiplication or not to execute the multiplication. The arithmetic circuit according to any one of items 5 to 5.

The control means has a mask register that connects to each of the plurality of multipliers and outputs the arithmetic control signal, and when controlling the multiplier so as not to execute the multiplication, the mask register The arithmetic circuit according to claim 6, wherein the unit masks the supply of the reference data or the coefficient data to the multiplier and controls the signal of the multiplier so as not to transition.

The arithmetic circuit according to any one of claims 1 to 7, further comprising a plurality of cumulative adders that cumulatively add the multiplication results of the plurality of multipliers.

A conversion means that performs non-linear conversion processing on the output data of each of the plurality of cumulative adders, and
A selection means for selecting and outputting either the output data of each of the plurality of cumulative adders and the output data of the conversion means, and
The arithmetic circuit according to claim 8, further comprising.

An arithmetic circuit connected to a holding device that holds reference data for filter arithmetic processing and coefficient data of a filter used for the filter arithmetic processing.
A plurality of multipliers that execute the filter calculation process by repeatedly multiplying the reference data that is different from each other and the coefficient data that is common to each other.
A first data supply means for supplying the plurality of multipliers with different reference data held in the holding device.
A second data supply means for supplying the common coefficient data held in the holding device to the plurality of multipliers, and
An input means for inputting image data from the outside as the reference data, and
A detection means that detects a processing target area from the input image data and acquires pixel data of the processing target area as the reference data.
Based on the data size of the processing target area, among the plurality of multipliers, to execute the multiplication for some multipliers, controls not to execute the multiplication to the other multiplier Control means and
An arithmetic circuit characterized by having.

The holding device further holds a table showing the correspondence between the number of the multipliers performing the multiplication and the data size of the processing target area.
The arithmetic circuit according to claim 10, wherein the control means refers to the table held in the holding device to determine the number of multipliers that perform the multiplication.

It is an arithmetic circuit connected to a holding device that holds reference data of a plurality of times of filter arithmetic processing and coefficient data of a filter used for the filter arithmetic processing.
A plurality of multipliers that execute the filter calculation process by repeatedly multiplying the reference data that is different from each other and the coefficient data that is common to each other.
A first data supply means for supplying the plurality of multipliers with different reference data held in the holding device.
A second data supply means for supplying the common coefficient data held in the holding device to the plurality of multipliers, and
With respect to the plurality of data output as a result of the first filter calculation process, the data obtained by the pooling process or the subsampling process is acquired as reference data of the second filter calculation process and held in the holding device. Processing means to make
Based on the data size of the reference data of the second filtering operation, among the plurality of multipliers, to execute the multiplication for some multipliers, the multiplication with respect to the other multiplier Control means to control so that it is not executed,
An arithmetic circuit characterized by having.

The holding device further holds a table showing the correspondence between the number of the multipliers performing the multiplication and the data size of the reference data of the second filter calculation process.
12. The arithmetic circuit according to claim 12, wherein the control means refers to the table held in the holding device to determine the number of multipliers that perform the multiplication.

An image processing apparatus having the arithmetic circuit according to any one of claims 1 to 11 and processing image data as the reference data.

It is a control method of an arithmetic circuit connected to a holding device that holds the reference data of the filter arithmetic processing and the coefficient data of the filter used for the filter arithmetic processing.
A multiplication step in which the filter calculation process is executed by a plurality of multipliers by repeatedly multiplying the reference data that is different from each other and the coefficient data that is common to each other.
A first supply step of supplying the plurality of multipliers with different reference data held in the holding device by the first data supply means.
A second supply step of supplying the common coefficient data held by the holding device to the plurality of multipliers by the second data supply means, and
The time for the multiplier to repeatedly execute the multiplication increases as the filter size of the filter calculation process increases, and when the filter size is equal to or less than a predetermined value, a power of a part of the plurality of multipliers is used. A control step in which a control means controls so that a multiplier is made to perform the multiplication and another multiplier is not made to perform the multiplication.
A method characterized by having.

It is a control method of an arithmetic circuit connected to a holding device that holds the reference data of the filter arithmetic processing and the coefficient data of the filter used for the filter arithmetic processing.
A multiplication step in which the filter calculation process is executed by a plurality of multipliers by repeatedly multiplying the reference data that is different from each other and the coefficient data that is common to each other.
A first supply step of supplying the plurality of multipliers with different reference data held in the holding device by the first data supply means.
A second supply step of supplying the common coefficient data held by the holding device to the plurality of multipliers by the second data supply means, and
An input process in which image data is input from the outside as the reference data by an input means,
An acquisition step of detecting a processing target area from the input image data and acquiring pixel data of the processing target area as the reference data by a detection means.
Based on the data size of the processing target area, among the plurality of multipliers, to execute the multiplication for some multipliers, other control means so as not to execute the multiplication with respect to the multiplier Control process controlled by
A method characterized by having.

It is a control method of an arithmetic circuit connected to a holding device that holds reference data of a plurality of filter arithmetic processes and coefficient data of a filter used for the multiple filter arithmetic processes.
A multiplication step in which the filter calculation process is executed by a plurality of multipliers by repeatedly multiplying the reference data that is different from each other and the coefficient data that is common to each other.
A first supply step of supplying the plurality of multipliers with different reference data held in the holding device by the first data supply means.
A second supply step of supplying the common coefficient data held by the holding device to the plurality of multipliers by the second data supply means, and
With respect to the data output as a result of the first filter calculation process, the data obtained by the pooling process or the subsampling process is acquired by the processing means as the reference data of the second filter calculation process, and is stored in the holding device. The acquisition process to hold and
Based on the data size of the reference data of the second filtering operation, among the plurality of multipliers, to execute the multiplication for some multipliers, the multiplication with respect to the other multiplier A control process that is controlled by control means so that it is not executed,
A method characterized by having.

A control program of an arithmetic circuit connected to a holding device that holds reference data for filter arithmetic processing and coefficient data of a filter used for the filter arithmetic processing.
A multiplication step in which a plurality of multipliers execute the filter calculation process by repeatedly multiplying the reference data that is different from each other and the coefficient data that is common to each other.
A first supply step of causing the plurality of multipliers to supply the first data supply with the different reference data held in the holding device .
A second supply step of causing the second data supply means to supply the common coefficient data held by the holding device to the plurality of multipliers.
The time for the multiplier to repeatedly execute the multiplication increases as the filter size of the filter calculation process increases, and when the filter size is equal to or less than a predetermined value, a power of a part of the plurality of multipliers is used. A control step that causes an arithmetic unit to execute the multiplication and a control means to control other multipliers so that the multiplication is not executed.
A program characterized by having a computer execute.

A control program of an arithmetic circuit connected to a holding device that holds reference data for filter arithmetic processing and coefficient data of a filter used for the filter arithmetic processing.
A multiplication step in which a plurality of multipliers execute the filter calculation process by repeatedly multiplying the reference data that is different from each other and the coefficient data that is common to each other.
A first supply step of causing the first data supply means to supply the plurality of multipliers with different reference data held in the holding device.
A second supply step of causing the second data supply means to supply the common coefficient data held by the holding device to the plurality of multipliers.
An input step in which image data is input to the input means from the outside as the reference data, and
An acquisition step of causing a detection means to detect a processing target area from the forced image data and acquiring pixel data of the processing target area as the reference data.
Based on the data size of the processing target area, among the plurality of multipliers, to execute the multiplication for some multipliers, other control means so as not to execute the multiplication with respect to the multiplier Control steps to be controlled by
A program characterized by having a computer execute.

It is a control program of an arithmetic circuit connected to a holding device that holds reference data of a plurality of filter arithmetic processes and coefficient data of a filter used for the multiple filter arithmetic processes.
A multiplication step in which a plurality of multipliers execute the filter calculation process by repeatedly multiplying the reference data that is different from each other and the coefficient data that is common to each other.
A first supply step of causing the first data supply means to supply the plurality of multipliers with different reference data held in the holding device.
A second supply step of causing the second data supply means to supply the common coefficient data held by the holding device to the plurality of multipliers.
With respect to the data output as a result of the first filter calculation process, the processing means is made to acquire the data obtained by the pooling process or the subsampling process as the reference data of the second filter calculation process, and the holding device is used. The acquisition process to hold and
Based on the data size of the reference data of the second filtering operation, among the plurality of multipliers, to execute the multiplication for some multipliers, the multiplication with respect to the other multiplier A control step that causes the control means to control it so that it is not executed,
A program characterized by having a computer execute.

Multiple multipliers that perform filter arithmetic processing,
Based on the filter size in the filter calculation process, some multipliers among the plurality of multipliers are not allowed to execute the filter calculation process, and other multipliers are not allowed to execute the filter calculation process. Control means to control
An arithmetic circuit characterized by having.

Multiple multipliers that perform filter arithmetic processing,
Based on the data size of the reference data of the filter calculation process, some of the plurality of multipliers are made to execute the filter calculation process, and the other multipliers are subjected to the filter calculation process. Control means to control not to execute,
An arithmetic circuit characterized by having.

The reference data is a processing target area of image data, and is
Based on the data size of the processing target area, the control means causes some of the plurality of multipliers to execute the filter calculation process, and causes the other multipliers to perform the filter calculation process. The arithmetic circuit according to claim 22, wherein the processing is controlled so as not to be executed.

The arithmetic circuit according to any one of claims 1 to 14 and 21 to 23, wherein the filter arithmetic processing is a convolution arithmetic processing for feature extraction.

The arithmetic circuit according to any one of claims 1 to 14 and 21 to 24, wherein the arithmetic circuit performs an arithmetic for Convolutional Neural Networks.

The arithmetic circuit according to any one of claims 1 to 14 and 21 to 25, wherein the coefficient data of the filter used in the filter arithmetic processing is determined by machine learning.

The control means has claims 1 to 14 and 21 to 26 that control the number of the multipliers to perform the processing among the plurality of multipliers based on the transfer time of the data used for the processing of the plurality of multipliers. The arithmetic circuit according to any one of the items.

A control method for an arithmetic circuit having a plurality of multipliers for executing filter arithmetic processing.
Based on the filter size in the filter calculation process, some multipliers among the plurality of multipliers are not allowed to execute the filter calculation process, and other multipliers are not allowed to execute the filter calculation process. A method of controlling an arithmetic circuit, which is characterized in that it is controlled to.

A control method for an arithmetic circuit having a plurality of multipliers for executing filter arithmetic processing.
Based on the data size of the reference data of the filter calculation process, some of the plurality of multipliers are made to execute the filter calculation process, and the other multipliers are subjected to the filter calculation process. A method of controlling an arithmetic circuit, which is characterized in that it is controlled so as not to be executed .