JP6964969B2

JP6964969B2 - Arithmetic processing unit, arithmetic processing method and program

Info

Publication number: JP6964969B2
Application number: JP2016193553A
Authority: JP
Inventors: 悠介谷内出; 政美加藤; 貴久山本; 修野村; 嘉則伊藤; 克彦森
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2016-09-30
Filing date: 2016-09-30
Publication date: 2021-11-10
Anticipated expiration: 2036-09-30
Also published as: JP2018055570A

Description

本発明は、パターン認識等に使用される演算処理装置、演算処理方法及びプログラムに関するものである。 The present invention relates to an arithmetic processing unit, an arithmetic processing method, and a program used for pattern recognition and the like.

ディープネット（或いはディープニューラルネット、ディープラーニングとも称される）と呼ばれる多階層のニューラルネットワークが、近年非常に大きな注目を集めている。ディープネットは、特定の演算手法を指すものではないが、一般的には、入力データ（例えば、画像データ）に対して、ある階層の処理結果を、その後段の階層の処理の入力とする階層的な演算処理を行うものを指す。特に画像識別の分野では、畳込みフィルタ演算を行う畳込み層と、結合演算を行う結合層とから構成されるディープネットが主流になりつつある。ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋｓ（以下ＣＮＮと略記する）はそのディープネットの実現方法として代表的な手法であり、以下ＣＮＮに基づく手法について説明する。 Multi-layer neural networks called deep nets (also called deep neural networks or deep learning) have received a great deal of attention in recent years. The deep net does not refer to a specific calculation method, but generally, a hierarchy in which the processing result of a certain hierarchy is input to the processing of the subsequent hierarchy with respect to the input data (for example, image data). Refers to those that perform general arithmetic processing. In particular, in the field of image identification, a deep net composed of a convolution layer that performs a convolution filter calculation and a coupling layer that performs a coupling calculation is becoming mainstream. Convolutional Neural Networks (hereinafter abbreviated as CNN) is a typical method for realizing the deep net, and a method based on CNN will be described below.

図３は、畳込みフィルタ演算の例を示す図である。図３では、処理対象画像３０１に対して、フィルタカーネル３０２のカーネルサイズが３×３のフィルタ演算を行う場合を示している。このような場合、次式に示す積和演算処理により、畳込みフィルタ演算結果が算出される。 FIG. 3 is a diagram showing an example of a convolution filter operation. FIG. 3 shows a case where a filter operation in which the kernel size of the filter kernel 302 is 3 × 3 is performed on the image 301 to be processed. In such a case, the convolution filter calculation result is calculated by the product-sum calculation process shown in the following equation.

ここで、「ｄ_ｉ，ｊ」は座標（ｉ，ｊ）での処理対象画像画素値を示し、「ｆ_ｉ，ｊ」は座標（ｉ，ｊ）でのフィルタ演算結果を示す。また、「ｗ_ｓ，ｔ」は座標（ｉ＋ｓ−１，ｊ＋ｔ−１）の処理対象画像画素値に適用するフィルタカーネルの値（フィルタ係数パラメータ）を示し、「ｃｏｌｕｍｎＳｉｚｅ」及び「ｒｏｗＳｉｚｅ」はフィルタカーネルサイズを示す。フィルタカーネル３０２を処理対象画像中でスキャンさせつつ、上記の演算を行うことで、畳込みフィルタ演算の出力結果を得ることができる。

Here, "di _{, j} " indicates the image pixel value to be processed at the coordinates (i, j), and "fi _{, j} " indicates the filter calculation result at the coordinates (i, j). Further, _{"w s, t"} indicates the coordinates (i + s-1, j + t-1) of the processing target image pixel value filter kernel values that apply to (the filter coefficient parameters), "columnSize" and "rowSize" filter kernel Indicates the size. By performing the above calculation while scanning the filter kernel 302 in the image to be processed, the output result of the convolution filter calculation can be obtained.

この畳込みフィルタ演算及びシグモイド変換に代表される非線形変換処理から、特徴量が生成される。入力画像に対してこの特徴量を生成する演算を階層的に繰り返し行うことで、画像の特徴を表現する特徴面が得られる。つまり、入力画像全体に対して畳込みフィルタ演算を繰り返して生成された二次元特徴量が特徴面になる。 Features are generated from the non-linear transformation processing represented by the convolution filter calculation and the sigmoid transformation. By repeating the operation of generating this feature amount hierarchically on the input image, a feature surface expressing the features of the image can be obtained. That is, the two-dimensional feature amount generated by repeating the convolution filter calculation for the entire input image becomes the feature surface.

画像からの特徴量抽出処理に畳込みフィルタ演算を用い、抽出した特徴量を用いた識別処理に、パーセプトロンに代表される行列積演算を用いる構成になっているものを典型的なディープネットとしている。この特徴量抽出処理は畳込みフィルタ演算を何度も繰り返す多階層処理であることが多く、また識別処理も全結合の多階層のパーセプトロンが用いられることがある。この構成は、近年盛んに研究されているディープネットとしては非常に一般的な構成である。 A typical deep net is a structure in which a convolution filter operation is used for feature amount extraction processing from an image, and a matrix product calculation represented by perceptron is used for identification processing using the extracted feature amount. .. This feature extraction process is often a multi-layer process in which the convolution filter operation is repeated many times, and the identification process may also use a fully coupled multi-layer perceptron. This configuration is a very common configuration for a deep net that has been actively studied in recent years.

ここで、図４を用いてディープネットの演算例について説明する。図４は、入力層となる入力画像４０１に対して、畳込みフィルタ演算により特徴データ抽出を行い、特徴面４０７の特徴量が得られた後、特徴面４０７の特徴量に対して、識別処理を行い、識別結果４１４を得るような処理を示している。入力画像４０１から特徴面４０７を得るまでに畳込みフィルタ演算を何度も繰り返している。また、特徴面４０７の特徴量に対して全結合のパーセプトロン処理を複数回行い、最終的な識別結果４１４を得ている。 Here, an example of deep net calculation will be described with reference to FIG. In FIG. 4, feature data is extracted from the input image 401 as an input layer by a convolution filter calculation, and after the feature amount of the feature surface 407 is obtained, the feature amount of the feature surface 407 is identified. Is performed to obtain the identification result 414. The convolution filter calculation is repeated many times until the feature surface 407 is obtained from the input image 401. Further, the feature amount of the feature surface 407 is subjected to the perceptron treatment of full binding a plurality of times, and the final identification result 414 is obtained.

まず、前半の畳込みフィルタ演算を説明する。図４において、入力画像４０１は、画像データに対してラスタスキャンされた所定サイズの画像データを示す。特徴面４０３ａ〜４０３ｃは第１段目の階層４０８の特徴面を示す。前述のとおり、特徴面とは、所定の特徴抽出フィルタ（畳込みフィルタ演算及び非線形処理）の演算結果を示すデータ面である。ラスタスキャンされた画像データに対する演算結果であるため、演算結果も面で表される。特徴面４０３ａ〜４０３ｃは、入力画像４０１に対する畳込みフィルタ演算及び非線形処理により生成される。例えば、特徴面４０３ａは、フィルタカーネル４０２１ａを用いた畳込みフィルタ演算及び演算結果の非線形変換により得られる。なお、図４中のフィルタカーネル４０２１ｂ及びフィルタカーネル４０２１ｃは、各々特徴面４０３ｂ及び特徴面４０３ｃを生成する際に使用されるフィルタカーネルである。上述の各特徴面生成のための畳込みフィルタ演算関係にある構造を、階層的な結合関係と呼ぶ。 First, the first half of the convolution filter operation will be described. In FIG. 4, the input image 401 shows image data of a predetermined size raster-scanned with respect to the image data. The feature planes 403a to 403c show the feature planes of the first stage layer 408. As described above, the feature plane is a data plane showing the calculation result of a predetermined feature extraction filter (convolution filter calculation and non-linear processing). Since it is the calculation result for the raster-scanned image data, the calculation result is also represented by a surface. The feature planes 403a to 403c are generated by the convolution filter calculation and the non-linear processing on the input image 401. For example, the feature plane 403a is obtained by a convolution filter calculation using the filter kernel 4021a and a non-linear transformation of the calculation result. The filter kernel 4021b and the filter kernel 4021c in FIG. 4 are filter kernels used when generating the feature surface 403b and the feature surface 403c, respectively. The structure having a convolution filter calculation relationship for generating each feature surface described above is called a hierarchical connection relationship.

次に、第２段目の階層４０９の特徴面４０５ａを生成する演算について説明する。 Next, an operation for generating the feature surface 405a of the second-stage layer 409 will be described.

特徴面４０５ａは前段の階層４０８の一部の特徴面である３つの特徴面４０３ａ〜４０３ｃと結合している。従って、特徴面４０５ａのデータを算出する場合、特徴面４０３ａに対してはフィルタカーネル４０４１ａで示すカーネルを用いた畳込みフィルタ演算を行い、この結果を保持する。同様に、特徴面４０３ｂ及び４０３ｃに対しては、各々フィルタカーネル４０４２ａ及び４０４３ａの畳込みフィルタ演算を行い、これらの結果を保持する。これらの３種類のフィルタ演算の終了後、保持された結果を加算し、非線形変換処理を行う。以上の処理を画像全体に対して処理することにより、特徴面４０５ａを生成する。 The feature surface 405a is coupled to three feature surfaces 403a to 403c, which are a part of the feature surfaces of the layer 408 in the previous stage. Therefore, when calculating the data of the feature surface 405a, the convolution filter calculation using the kernel shown by the filter kernel 4041a is performed on the feature surface 403a, and the result is retained. Similarly, the feature planes 403b and 403c are subjected to convolution filter operations of the filter kernels 4042a and 4043a, respectively, and these results are retained. After the completion of these three types of filter operations, the retained results are added and a non-linear conversion process is performed. By processing the above processing on the entire image, the feature surface 405a is generated.

同様に、特徴面４０５ｂの生成の際には、前段の階層４０８の特徴面４０３ａ〜４０３ｃに対するフィルタカーネル４０４１ｂ、４０４２ｂ及び４０４３ｂによる３つの畳込みフィルタ演算を行う。また、第３段目の階層４１０の特徴面４０７の生成の際には、前段の階層４０９の特徴面４０５ａ〜４０５ｂに対するフィルタカーネル４０６１及び４０６２による２つの畳込みフィルタ演算を行う。 Similarly, when the feature surface 405b is generated, three convolution filter operations are performed by the filter kernels 4041b, 4042b, and 4043b on the feature surfaces 403a to 403c of the previous layer 408. Further, when the feature surface 407 of the third layer 410 is generated, two convolution filter operations are performed by the filter kernels 4061 and 4062 for the feature surfaces 405a to 405b of the previous layer 409.

続いて後半のパーセプトロン処理を説明する。図４では２階層のパーセプトロンになっている。パーセプトロンは、入力特徴量のそれぞれの要素に対する重み付き和を非線形変換したものである。従って、特徴面４０７の特徴量に対して、行列積演算を行い、その結果に非線形変換を行えば、中間処理結果４１３を得ることができる。さらに同様の処理を繰り返せば、最終的な識別結果４１４を得ることができる。 Next, the latter half of the perceptron treatment will be described. In FIG. 4, it is a two-layer perceptron. The perceptron is a non-linear transformation of the weighted sum for each element of the input feature. Therefore, the intermediate processing result 413 can be obtained by performing a matrix product operation on the feature amount of the feature surface 407 and performing a non-linear transformation on the result. If the same process is repeated, the final identification result 414 can be obtained.

畳込み層における階層的な畳込みフィルタ演算は、参照側の多数の特徴面を参照して、出力側の多数の特徴面を算出するという結合関係に基づく演算であるので、演算の処理量が多く、処理時間が長い。非特許文献１で示されているように、階層間の結合を粗密化することで畳込みフィルタ演算数やフィルタカーネルを削減することは可能である。 The hierarchical convolution filter operation in the convolution layer is an operation based on a coupling relationship in which a large number of feature planes on the reference side are referred to and a large number of feature planes on the output side are calculated. Many, processing time is long. As shown in Non-Patent Document 1, it is possible to reduce the number of convolution filter operations and the filter kernel by coarsening the coupling between layers.

ディープネットの階層構造が大規模化するにあたって、性能を向上・維持しつつ処理するために、複数の演算回路を用いて並列に複数の特徴面を算出する並列化処理を含めた効率的な処理を行っていく必要がある。 Efficient processing including parallel processing that calculates multiple feature planes in parallel using multiple arithmetic circuits in order to process while improving and maintaining performance as the hierarchical structure of the deep net increases in scale. It is necessary to go.

しかし、各階層間の特徴面にはたくさんの結合関係があるので、複数の演算回路が並列して算出する複数の特徴面を単純に分割して、複数の演算回路に割当てると、各演算回路のメモリでは、重複して保持する参照側の特徴面やカーネルフィルタが多い。重複して保持するデータに起因して、メモリへのアクセス時間が長くなり、複数の演算回路の並列化処理による高速な演算処理が妨げられる。 However, since there are many coupling relationships between the feature planes between each layer, if a plurality of feature planes calculated by a plurality of arithmetic circuits in parallel are simply divided and assigned to a plurality of arithmetic circuits, each arithmetic circuit is used. In the memory of, there are many feature planes and kernel filters on the reference side that are retained in duplicate. Due to the duplicated data, the access time to the memory becomes long, and the high-speed arithmetic processing by the parallel processing of a plurality of arithmetic circuits is hindered.

この問題を解決するために、特許文献１では、複数の階層間で予め結合関係を分割しておいて、分割した結合関係で独立に学習した後に、結合関係の独立した特徴面の算出を複数の演算回路に割当てて並列化処理を行う構造を提案している。 In order to solve this problem, in Patent Document 1, a plurality of independent feature planes of a coupling relationship are calculated after dividing the coupling relationship in advance between a plurality of layers and learning independently from the divided coupling relationship. We are proposing a structure that assigns to the arithmetic circuit of the above and performs parallelization processing.

ＷＯ２０１４１０５８６５ＡWO20141058565A

ＹａｎｎＬｅＣｕｎ，ＬｅｏｎＢｏｔｔｏｕ，ＹｏｓｈｕａＢｅｎｇｉｏ，ａｎｄＰａｔｒｉｃｋＨａｆｆｎｅｒ．“Ｇｒａｄｉｅｎｔ−ＢａｓｅｄＬｅａｒｎｉｎｇＡｐｐｌｉｅｄｔｏＤｏｃｕｍｅｎｔＲｅｃｏｇｎｉｔｉｏｎ，”ｉｎｐｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅＩＥＥＥ，８６（１１）：２２７８−２３２４，Ｎｏｖｅｍｂｅｒ１９９８Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. "Gradient-Based Learning Applied to Document Recognition," in procedings of the IEEE, 86 (11): 2278-2324, November 1998

しかしながら、特許文献１の手法は予め階層的結合関係の構造を決めた後に学習する前提があるので、機械学習の制限があり、非特許文献１の従来手法で学習した結合関係に対して特許文献１の手法を適用することはできない。そのために、特許文献１の手法では、学習の自由度が少ないことによる画像識別や物体検出の精度への影響はある。つまり、特許文献１の手法の予め並列化を見越した構造は、非特許文献１で示したよう粗密な階層的結合関係のような結合関係の構成に対して適用できないので、複数の演算回路による並列化処理を高速に実行することができない。 However, since the method of Patent Document 1 is premised on learning after determining the structure of the hierarchical connection relationship in advance, there is a limitation of machine learning, and the patent document refers to the connection relationship learned by the conventional method of Non-Patent Document 1. Method 1 cannot be applied. Therefore, in the method of Patent Document 1, there is an influence on the accuracy of image identification and object detection due to the small degree of freedom of learning. That is, the structure in anticipation of parallelization in advance of the method of Patent Document 1 cannot be applied to the configuration of a coupling relationship such as a coarse hierarchical coupling relationship as shown in Non-Patent Document 1, and therefore, a plurality of arithmetic circuits are used. Parallelization processing cannot be executed at high speed.

本発明は、上記の課題に鑑みてなされたものであり、予め階層的結合関係の構造を決めていない場合でも、所定の階層における演算を、複数の演算回路に適切に割り当てることによって、並列化処理を高速に行う演算処理装置を提供することを目的とする。また、その演算処理方法及びプログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and even if the structure of the hierarchical connection relationship is not determined in advance , the operations in a predetermined layer are appropriately assigned to a plurality of arithmetic circuits to be parallelized. An object of the present invention is to provide an arithmetic processing unit that performs processing at high speed. Another object of the present invention is to provide a calculation processing method and a program thereof.

上記課題を解決するために、本発明に係る演算処理装置は、以下の構成を有する。すなわち、
階層的な演算処理を並列に行う複数の演算回路と、
前記演算処理における所定の階層における複数の演算であって、前記所定の階層より前の階層における複数の演算結果のうちの少なくとも一部を参照する前記複数の演算それぞれを前記複数の演算回路に割当てる割当て手段とを有し、
前記割当て手段は、
演算する際に参照する前記所定の階層より前の階層における演算結果に係る前記複数の演算回路間での重複するデータ量に応じた評価値を割当て候補に対して算出し、
前記評価値が所定の基準を満たす割当て候補であって、演算する際に参照する前記所定の階層より前の階層における演算結果の前記複数の演算回路間での重複が少なくとも第１の割当て候補より少ない第２の割当て候補を前記複数の演算の前記複数の演算回路への割当てとして決定することを特徴とする。 In order to solve the above problems, the arithmetic processing unit according to the present invention has the following configuration. That is,
Multiple arithmetic circuits that perform hierarchical arithmetic processing in parallel,
A plurality of operations in a predetermined layer in the arithmetic process, and each of the plurality of operations that refers to at least a part of the operation results in the layers prior to the predetermined layer is assigned to the plurality of arithmetic circuits. Has an allocation means and
The allocation means
An evaluation value corresponding to the amount of overlapping data between the plurality of arithmetic circuits related to the arithmetic result in the layer prior to the predetermined layer to be referred to at the time of calculation is calculated for the allocation candidate.
The evaluation value is an allocation candidate satisfying a predetermined criterion, and the duplication of the calculation result in the hierarchy before the predetermined hierarchy referred to in the calculation among the plurality of arithmetic circuits is at least from the first allocation candidate. It is characterized in that a small number of second allocation candidates are determined as allocations of the plurality of operations to the plurality of arithmetic circuits.

本発明によれば、階層的な演算処理の所定の階層における演算を、複数の演算回路に適切に割り当てることによって、並列化処理を高速に行う演算処理装置を提供することができる。 According to the present invention, it is possible to provide an arithmetic processing unit that performs parallel processing at high speed by appropriately assigning arithmetic operations in a predetermined layer of hierarchical arithmetic processing to a plurality of arithmetic circuits.

第１の実施形態の演算処理装置のハード構成である。This is a hardware configuration of the arithmetic processing unit of the first embodiment. （ａ）演算回路の概略構成を示す図である。（ｂ）演算回路の制御部の構成を示す図である。(A) It is a figure which shows the schematic structure of the arithmetic circuit. (B) It is a figure which shows the structure of the control part of the arithmetic circuit. 畳込みフィルタ演算の模式図である。It is a schematic diagram of a convolution filter operation. 連続するディープネット階層間の関係の模式図である。It is a schematic diagram of the relationship between continuous deep net hierarchies. 階層的結合関係の一つの分割例である。This is an example of division of a hierarchical connection relationship. 階層的結合関係のもう一つの分割例である。This is another example of division of a hierarchical connection relationship. 第１の実施形態の演算処理装置の動作を説明するフローチャートである。It is a flowchart explaining the operation of the arithmetic processing unit of 1st Embodiment. 階層間の分割手法に基づく処理全体のフローチャートである。It is a flowchart of the whole process based on the division method between layers. 階層間の分割手法を表すフローチャートである。It is a flowchart which shows the division method between layers. 階層間の特徴面割当ての交換を説明する図である。It is a figure explaining the exchange of the characteristic plane allocation between layers. 階層間の特徴面割当ての交換を説明する図である。It is a figure explaining the exchange of the characteristic plane allocation between layers. 第２の実施形態のグラフ生成を表す模式図である。It is a schematic diagram which shows the graph generation of the 2nd Embodiment. 第２の実施形態のグラフカットに基づく全体フローチャートである。It is an overall flowchart based on the graph cut of the 2nd Embodiment. 第４の実施形態における階層的な結合の動的な変動を示す模式図である。It is a schematic diagram which shows the dynamic variation of the hierarchical connection in 4th Embodiment. 第５の実施形態のクラウドサーバーシステムの構成を示す図である。It is a figure which shows the structure of the cloud server system of 5th Embodiment.

（第１の実施形態）
本実施形態の目的は、階層的な結合関係の構成に応じて特徴面を効率よく高速に並列処理するための階層的な結合関係の分割の最適化を行うことにある。以下、図を用いて本実施形態の詳細について説明する。 (First Embodiment)
An object of the present embodiment is to optimize the division of the hierarchical connection relationship in order to efficiently and at high speed parallel process the feature planes according to the configuration of the hierarchical connection relationship. Hereinafter, the details of the present embodiment will be described with reference to the drawings.

図１は本実施形態の演算処理装置の構成例を示すものである。この演算処理装置は、入力された画像データから特定の物体を検出し、認識するパターン認識の機能を有する。 FIG. 1 shows a configuration example of the arithmetic processing unit of the present embodiment. This arithmetic processing unit has a pattern recognition function of detecting and recognizing a specific object from the input image data.

演算処理装置は、画像入力モジュール１０００、演算回路１００２−１〜１００２−ｎ、ＲＡＭ１００１−１〜１００１−ｎ、Ｉ／Ｆ１０１１、ＲＡＭ１００９、ＤＭＡＣ１００６、ＣＰＵ１００７及びＲＯＭ１００８などによって構成される。画像入力モジュール１０００は、光学系、ＣＣＤ又はＣＭＯＳセンサー等の光電変換デバイス及びセンサーを制御するドライバー回路／ＡＤコンバーター／各種画像補正を司る信号処理回路／フレームバッファ等により構成される。 The arithmetic processing apparatus is composed of an image input module 1000, arithmetic circuits 1002-1 to 1002-n, RAM1001-1 to 1001-n, I / F1011, RAM1009, DMAC1006, CPU1007, ROM1008, and the like. The image input module 1000 is composed of an optical system, a photoelectric conversion device such as a CCD or CMOS sensor, a driver circuit for controlling the sensor, an AD converter, a signal processing circuit for controlling various image corrections, a frame buffer, and the like.

ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）１００１−１〜１００１−ｎは、それぞれ演算回路１００２−１〜１００２−ｎの演算作業バッファとして使用される。 The RAM (Random Access Memory) 1001-1 to 1001-n are used as arithmetic work buffers of the arithmetic circuits 1002-1 to 1002-n, respectively.

また、本実施形態の並列演算処理装置は、並列処理を行うための演算回路１００２−１〜１００２−ｎおよびＲＡＭ１００１−１〜１００１−ｎは、複数ある。 Further, in the parallel arithmetic processing apparatus of the present embodiment, there are a plurality of arithmetic circuits 1002-1 to 1002-n and RAM 1001-1 to 1001-n for performing parallel processing.

図４の階層的な結合関係に示すように、出力側の複数の特徴面を算出するための参照側の複数の特徴面やそれぞれのフィルタカーネルは、演算処理装置の外部からＩ／Ｆ１０１１を通じて一旦、ＲＡＭ１００９に記憶される。ＲＡＭ１００９に保持してある階層的な結合関係に係る各種データを分けて、各演算回路１００２−１〜１００２−ｎのメモリであるＲＡＭ１００１−１〜１００１−ｎに記憶させるための割当てモジュール１０１２が、階層的な結合関係の分割情報を作成する。ここで、階層的な結合関係は、階層的な特徴面の結合関係の構造および各結合関係に対する畳込みフィルタ演算に必要な畳込みフィルタ係数、複数の特徴面に関する情報を含む。 As shown in the hierarchical connection relationship of FIG. 4, the plurality of feature planes on the reference side for calculating the plurality of feature planes on the output side and the respective filter kernels are once passed through the I / F 1011 from the outside of the arithmetic processing unit. , Stored in RAM 1009. The allocation module 1012 for dividing various data related to the hierarchical coupling relationship held in the RAM 1009 and storing them in the RAM 1001-1 to 1001-n, which is the memory of each arithmetic circuit 1002-1 to 1002-n, is used. Create split information for hierarchical join relationships. Here, the hierarchical connection relationship includes the structure of the connection relationship of the hierarchical feature surfaces, the convolution filter coefficient required for the convolution filter calculation for each connection relationship, and information on a plurality of feature surfaces.

演算回路１００２−１〜１００２−ｎは、本実施形態に関する階層的なの結合関係の分割情報に基づいて割り当てられた畳込みフィルタ演算を処理するＣＮＮ処理部である。 The arithmetic circuits 1002-1 to 1002-n are CNN processing units that process the convolution filter arithmetic assigned based on the division information of the hierarchical coupling relationship according to the present embodiment.

ＤＭＡＣ（ＤｉｒｅｃｔＭｅｍｏｒｙＡｃｃｅｓｓＣｏｎｔｒｏｌｌｅｒ）１００６は、画像バス１００３上の各処理モジュールとＣＰＵバス１０１０間のデータ転送を司る。ブリッジ１００４は、画像バス１００３とＣＰＵバス１０１０のブリッジ機能を提供する。前処理モジュール１００５は、ＣＮＮ処理によるパターン認識処理を効果的に行うための各種前処理を行う。具体的には色変換処理／コントラスト補正処理等の画像データ変換処理をハードウェアで処理する。なお、前処理モジュール１００５は本実施形態の演算処理装置に含まれなくてもよい。ＣＰＵ１００７は、装置全体の動作を制御するものである。ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）１００８は、ＣＰＵ１００７の動作を規定する命令やパラメータデータを格納する。ＲＡＭ１００９はＣＰＵ１００７の動作に必要なメモリである。ＣＰＵ１００７はブリッジ１００４を介して画像バス１００３上のＲＡＭ１００１にアクセスする事も可能である。なお、階層的な結合関係の分割情報は、割当てモジュール１０１２によって生成されることに限らず、Ｉ／Ｆ１０１１を通じて外部の装置から取得してもよい。例えば、外部のＰＣにおいて、階層的な結合関係の分割情報が生成され、Ｉ／Ｆ１０１１を通じてＲＡＭ１００９に保持された場合は、ＲＡＭ１００９に保持され分割情報を用いて、並列処理を行うことができる。 The DMAC (Direct Memory Access Controller) 1006 controls data transfer between each processing module on the image bus 1003 and the CPU bus 1010. The bridge 1004 provides a bridge function between the image bus 1003 and the CPU bus 1010. The pre-processing module 1005 performs various pre-processing for effectively performing the pattern recognition processing by the CNN processing. Specifically, image data conversion processing such as color conversion processing / contrast correction processing is processed by hardware. The preprocessing module 1005 does not have to be included in the arithmetic processing unit of the present embodiment. The CPU 1007 controls the operation of the entire device. The ROM (Read Only Memory) 1008 stores instructions and parameter data that define the operation of the CPU 1007. The RAM 1009 is a memory required for the operation of the CPU 1007. The CPU 1007 can also access the RAM 1001 on the image bus 1003 via the bridge 1004. The division information of the hierarchical connection relationship is not limited to being generated by the allocation module 1012, and may be acquired from an external device through the I / F 1011. For example, when the division information of the hierarchical connection relationship is generated in the external PC and held in the RAM 1009 through the I / F 1011, the division information is held in the RAM 1009 and the parallel processing can be performed using the division information.

本実施形態の演算処理装置の各演算回路１００２−１〜１００２−ｎの構成及び動作は同じであるので、図２（ａ）を用いて代表的な演算回路１００２の内部構成及び動作を説明する。制御部６０１は分割情報に基づき、必要な参照側の特徴面のデータ・畳込みフィルタ係数をＲＡＭ１００１に格納する。なお、本実施形態の演算処理装置の各演算回路１００２−１〜１００２−ｎに対応するメモリであるＲＡＭ１００１−１〜１００１−ｎも同じであるので、ＲＡＭ１００１はそのうちの一つを示す。制御部６０１は、ＲＡＭ１００１に保持した参照側の特徴面のデータ・畳込みフィルタ係数の読み出し、畳込み演算部６０２への供給を行う。畳込み演算部６０２は式（１）で示した演算に基づいて、参照側の特徴面、畳込みフィルタ係数に対し畳込み演算を行い、演算結果を出力する。制御部６０１は前記演算結果をＲＡＭ１００９に出力する。 Since the configurations and operations of the arithmetic circuits 1002-1 to 1002-n of the arithmetic processing unit of the present embodiment are the same, the internal configuration and operation of a typical arithmetic circuit 1002 will be described with reference to FIG. 2A. .. Based on the division information, the control unit 601 stores the necessary data / convolution filter coefficient of the feature surface on the reference side in the RAM 1001. Since RAM 1001-1 to 1001-n, which is a memory corresponding to each arithmetic circuit 1002-1 to 1002-n of the arithmetic processing unit of the present embodiment, is also the same, RAM 1001 indicates one of them. The control unit 601 reads out the data and the convolution filter coefficient of the feature surface on the reference side held in the RAM 1001 and supplies the data to the convolution calculation unit 602. The convolution calculation unit 602 performs a convolution operation on the characteristic surface on the reference side and the convolution filter coefficient based on the operation represented by the equation (1), and outputs the calculation result. The control unit 601 outputs the calculation result to the RAM 1009.

図２（ｂ）は、図２（ａ）の制御部６０１の詳細な構成を説明する図である。シーケンス制御部１２０１は、レジスタ群１２０２に設定された情報に従って、演算回路の動作を制御する各種制御信号１２０４を入出力する。同様に、シーケンス制御部１２０１はメモリ制御部１２０５を制御するための制御信号１２０６を算出する。シーケンス制御部１２０１はバイナリカウンタやジョンソンカウンタ等からなるシーケンサにより構成される。レジスタ群１２０２は複数のレジスタセットからなり、例えば、参照側の特徴面や算出する特徴面に関する情報、カーネルに関する情報、階層を分割されて保持される特徴面の等に関する情報等が記録される。レジスタ群１２０２は、ブリッジ１００４及び画像バス１００３を介してＣＰＵ１００７から予め所定の値が書き込まれる。 FIG. 2B is a diagram illustrating a detailed configuration of the control unit 601 of FIG. 2A. The sequence control unit 1201 inputs and outputs various control signals 1204 that control the operation of the arithmetic circuit according to the information set in the register group 1202. Similarly, the sequence control unit 1201 calculates the control signal 1206 for controlling the memory control unit 1205. The sequence control unit 1201 is composed of a sequencer including a binary counter, a Johnson counter, and the like. The register group 1202 is composed of a plurality of register sets, and for example, information on a feature surface on the reference side, information on a feature surface to be calculated, information on a kernel, information on a feature surface held by dividing a hierarchy, and the like are recorded. A predetermined value of the register group 1202 is written in advance from the CPU 1007 via the bridge 1004 and the image bus 1003.

メモリ制御部１２０５は、シーケンス制御部１２０１からの制御信号１２０６に基づいてＲＡＭから参照側の特徴面のデータ１２０７および畳込みフィルタ係数データ１２０８を畳込み演算部６０２に供給する。演算結果１２０９は畳込み演算部６０２より取得する。ここで、供給される参照側の特徴面のデータおよび畳込みフィルタ係数データは階層的な結合関係の分割情報に基づくものである。 The memory control unit 1205 supplies the data 1207 of the feature surface on the reference side and the convolution filter coefficient data 1208 from the RAM to the convolution calculation unit 602 based on the control signal 1206 from the sequence control unit 1201. The calculation result 1209 is acquired from the convolution calculation unit 602. Here, the reference-side feature plane data and the convolution filter coefficient data that are supplied are based on the division information of the hierarchical connection relationship.

図５は、参照側の階層１２３および出力側の階層１２４の結合関係の初期分割例を示している。異なるフィルタカーネルによる特徴面間の畳込みフィルタ演算の結合関係を矢印１１３〜１２２で表しており、階層１２３内の特徴面１０１〜１０６を参照側の特徴面として、階層１２４内の特徴面１０７〜１１２に上記矢印１１３〜１２２で接続されている。ここで、矢印１１３〜１２２はフィルタカーネルによる結合関係を示すと共に、フィルタカーネルによる畳込みフィルタ演算を行うことを示している。例えば、矢印１１３は、参照側の特徴面１０１から出力側の特徴面１０７が畳込みフィルタ演算によって生成されることを示すと共に、フィルタカーネル１１３を用いて畳込みフィルタ演算を行うことを示す。 FIG. 5 shows an example of initial division of the connection relationship between the reference side layer 123 and the output side layer 124. The coupling relationship of the convolution filter operations between the feature planes by different filter kernels is represented by arrows 113 to 122, and the feature planes 101 to 106 in the hierarchy 123 are set as the reference side feature planes, and the feature planes 107 to 107 in the hierarchy 124. It is connected to 112 by the arrows 113 to 122. Here, arrows 113 to 122 indicate the coupling relationship by the filter kernel and indicate that the convolution filter operation is performed by the filter kernel. For example, the arrow 113 indicates that the feature surface 101 on the reference side to the feature surface 107 on the output side are generated by the convolution filter calculation, and indicates that the convolution filter calculation is performed using the filter kernel 113.

出力側の階層である階層１２４の視点から見ると、特徴面１０７はフィルタカーネル１１３によって特徴面１０１を参照側の特徴面として生成され、特徴面１０８はフィルタカーネル１１５によって特徴面１０２を参照側の特徴面として生成される。特徴面１０９はフィルタカーネル１１６、１１９によって、それぞれ特徴面１０３、１０５を参照側の特徴面として生成され、特徴面１１０はフィルタカーネル１１４、１２１によって、それぞれ特徴面１０１、１０６を参照側の特徴面として生成される。特徴面１１１はフィルタカーネル１１７、１１８によって、それぞれ特徴面１０２、１０４を参照側の特徴面として生成され、特徴面１１２はフィルタカーネル１２０、１２２によって、それぞれ特徴面１０３、１０６を参照側の特徴面データとして生成される。一つの演算回路が、特徴面１０７、特徴面１０８及び特徴面１０９を順次に算出するために、その演算回路のメモリから特徴面１０１、特徴面１０２、特徴面１０３及び特徴面１０５とそれぞれの特徴面に対応するフィルタカーネルを順次に読み出す必要がある。 From the viewpoint of the layer 124, which is the layer on the output side, the feature surface 107 is generated by the filter kernel 113 with the feature surface 101 as the feature surface on the reference side, and the feature surface 108 is generated by the filter kernel 115 with the feature surface 102 on the reference side. Generated as a feature plane. The feature plane 109 is generated by the filter kernels 116 and 119 with the feature planes 103 and 105 as the reference side feature planes, respectively, and the feature plane 110 is generated by the filter kernels 114 and 121 with the feature planes 101 and 106 as the reference side feature planes, respectively. Is generated as. The feature plane 111 is generated by the filter kernels 117 and 118 with the feature planes 102 and 104 as the reference side feature planes, respectively, and the feature plane 112 is generated by the filter kernels 120 and 122 with the feature planes 103 and 106 as the reference side feature planes, respectively. Generated as data. In order for one arithmetic circuit to sequentially calculate the feature plane 107, the feature plane 108, and the feature plane 109, the feature plane 101, the feature plane 102, the feature plane 103, and the feature plane 105 are each featured from the memory of the calculation circuit. It is necessary to read the filter kernel corresponding to the surface sequentially.

次に、図５を用いて、並列化のための階層的な結合関係の初期分割の概要および初期分割の結果について説明する。初期分割は、割当てモジュール１０１２が分割条件に基づいて実行される。分割条件は、使用可能な演算回路の個数に基づく分割数の指定や演算回路のメモリで保持する最大なデータ量などである。また、処理負荷を分散するために、割当てモジュール１０１２は、算出される特徴面の数と参照される特徴面の数などに基づいて、それぞれの演算回路の演算量の差が小さくなるように、初期分割の処理を行う。割当てモジュール１０１２の初期分割の処理によって、それぞれの演算回路が算出する出力側の特徴面として、それぞれの演算回路に割当てられる特徴面の一つの候補が得られる。 Next, with reference to FIG. 5, the outline of the initial division of the hierarchical connection relationship for parallelization and the result of the initial division will be described. The initial division is executed by the allocation module 1012 based on the division conditions. The division conditions include the specification of the number of divisions based on the number of usable arithmetic circuits and the maximum amount of data held in the memory of the arithmetic circuits. Further, in order to distribute the processing load, the allocation module 1012 reduces the difference in the amount of calculation of each arithmetic circuit based on the number of calculated feature planes and the number of reference feature planes. Performs initial division processing. By the processing of the initial division of the allocation module 1012, one candidate of the characteristic surface assigned to each arithmetic circuit can be obtained as the characteristic surface on the output side calculated by each arithmetic circuit.

図５は、階層間の結合関係の初期分割の結果を示し、演算によって生成される特徴面を分割１２５および分割１２６に分割した様子を示す。分割１２５では特徴面１０７、１０８、１０９を一つの単位として、特徴面１０７、１０８、１０９を算出するための畳込みフィルタ演算を演算回路１００２−１に割当てている。分割１２６では特徴面１１０、１１１、１１２を一つの単位として、特徴面１１０、１１１、１１２を算出するための畳込みフィルタ演算を演算回路１００２−２に割当てている。 FIG. 5 shows the result of the initial division of the connection relationship between the layers, and shows how the feature plane generated by the calculation is divided into the division 125 and the division 126. In the division 125, the feature planes 107, 108, 109 are used as one unit, and the convolution filter calculation for calculating the feature planes 107, 108, 109 is assigned to the calculation circuit 1002-1. In the division 126, the feature planes 110, 111, 112 are used as one unit, and the convolution filter calculation for calculating the feature planes 110, 111, 112 is assigned to the calculation circuit 1002-2.

ここで、それぞれの特徴面を生成するために必要な参照側の特徴面およびフィルタ係数は、割当てられた演算回路１００２のＲＡＭ１００１で保持して動作することを想定している。例えば、図５の分割１２５で示す特徴面１０７、１０８、１０９を生成するための参照側の特徴面１０１、１０２、１０３、１０５と、矢印１１３、１１５、１１６、１１９のそれぞれで示す畳込みフィルタ係数はＲＡＭ１００１−１に保持されている。また、分割１２６で示す特徴面１１０、１１１、１１２を生成するための参照側の特徴面１０１、１０２、１０３、１０６と、矢印１１４、１１７、１２０、１１８、１２２のそれぞれで示す畳込みフィルタ係数はＲＡＭ１００１−２に保持されている。 Here, it is assumed that the feature plane and the filter coefficient on the reference side required to generate each feature plane are held and operated by the RAM 1001 of the assigned arithmetic circuit 1002. For example, the reference side feature planes 101, 102, 103, 105 for generating the feature planes 107, 108, 109 shown in the division 125 of FIG. 5, and the convolution filters shown by the arrows 113, 115, 116, 119, respectively. The coefficient is held in RAM 1001-1. Further, the convolution filter coefficients shown by the feature planes 101, 102, 103, 106 on the reference side for generating the feature planes 110, 111, 112 shown in the division 126 and the arrows 114, 117, 120, 118, 122, respectively. Is held in RAM 1001-2.

ここで、保持されている保持データについて注目すると、参照側の特徴面１０１、１０２、１０３はそれぞれの割当て先の演算回路のＲＡＭ１００１で重複して保持する必要があるため、並列演算に必要なメモリ領域及びデータの転送量が増大する。 Here, paying attention to the retained data, the feature surfaces 101, 102, and 103 on the reference side need to be duplicated and retained by the RAM 1001 of the arithmetic circuit of each allocation destination, so that the memory required for parallel arithmetic is required. The area and data transfer amount increase.

次に、図５と同じ階層的な結合関係に対して、図５と異なる分割結果の一例を図６に示す。図６に示す分割結果は、本実施形態の割当てモジュール１０１２の処理によって、それぞれの演算回路が算出する出力側の特徴面として、それぞれの演算回路に割当てられる特徴面のもう一つの候補である。割当てモジュール１０１２は、図５に示す候補と図６に示す候補から、割当てを選択する。割当ての選択については後に述べるが、ここでは、まず、図６に示す分割結果について説明する。 Next, FIG. 6 shows an example of the division result different from that of FIG. 5 with respect to the same hierarchical connection relationship as in FIG. The division result shown in FIG. 6 is another candidate for the characteristic surface assigned to each arithmetic circuit as the characteristic surface on the output side calculated by each arithmetic circuit by the processing of the allocation module 1012 of the present embodiment. The allocation module 1012 selects an allocation from the candidates shown in FIG. 5 and the candidates shown in FIG. The selection of allocation will be described later, but here, first, the division result shown in FIG. 6 will be described.

図６に示す分割結果は、分割２０１と分割２０２である。分割２０１に示す特徴面１０７、１０８、１１１を生成するための演算処理を演算回路１００２−１に割当てており、分割２０２で示す特徴面１０９、１１０、１１２を生成するための演算処理を演算回路１００２−２に割当てている。 The division results shown in FIG. 6 are division 201 and division 202. The arithmetic processing for generating the characteristic surfaces 107, 108, 111 shown in the division 201 is assigned to the arithmetic circuit 1002-1, and the arithmetic processing for generating the characteristic surfaces 109, 110, 112 shown in the division 202 is assigned to the arithmetic circuit. It is assigned to 1002-2.

分割２０１で示す特徴面１０７、１０８、１１１を生成するための参照側の特徴面１０１、１０２、１０４と、矢印１１３、１１５、１１７、１１８のそれぞれで示す畳込みフィルタ係数のそれぞれをＲＡＭ１００１−１で保持する。また、分割２０２に示す特徴面１０９、１１０、１１２を生成するための参照側の特徴面１０１、１０３、１０５、１０６と、矢印１１４、１１６、１１９〜１２２のそれぞれで示す畳込みフィルタ係数のそれぞれをＲＡＭ１００１−２で保持する。 RAM1001-1 for each of the reference-side feature planes 101, 102, 104 for generating the feature planes 107, 108, 111 shown in the division 201 and the convolution filter coefficients indicated by the arrows 113, 115, 117, 118, respectively. Hold with. Further, the feature planes 101, 103, 105, 106 on the reference side for generating the feature planes 109, 110, 112 shown in the division 202, and the convolution filter coefficients shown by the arrows 114, 116, 119 to 122, respectively. Is held by RAM 1001-2.

ここで、ＲＡＭ１００１−１及びＲＡＭ１００１−２の保持データについて注目すると、参照側の特徴面１０１のみ、重複して保持する必要がある。 Here, paying attention to the retained data of the RAM 1001-1 and the RAM 1001-2, it is necessary to duplicately retain only the characteristic surface 101 on the reference side.

図５の分割１２５、１２６と図６の分割２０１、２０２とを比べると、出力側の階層１２４の特徴面を算出するための処理内容・演算量は同じにもかかわらず、ＲＡＭ１００１−１及びＲＡＭ１００１−２で重複して保持する特徴面の数が異なってくる。このように、並列処理を行うための割り当て単位次第で、メモリに保持すべきデータの重複量が変わってくるので、並列処理の効率が大きく左右される。 Comparing the divisions 125 and 126 of FIG. 5 with the divisions 201 and 202 of FIG. 6, the RAM 1001-1 and the RAM 1001 are the same, although the processing content and the amount of calculation for calculating the characteristic surface of the layer 124 on the output side are the same. The number of feature planes to be retained in duplicate at -2 is different. In this way, the amount of duplication of data to be held in the memory changes depending on the allocation unit for performing parallel processing, so that the efficiency of parallel processing is greatly affected.

図７は、本実施形態の並列演算装置がパターン認識を行うための動作を説明するフローチャートである。以下、フローチャートは、ＣＰＵ１００７が制御プログラムを実行することにより実現されるものとする。 FIG. 7 is a flowchart illustrating an operation for the parallel arithmetic unit of the present embodiment to perform pattern recognition. Hereinafter, it is assumed that the flowchart is realized by the CPU 1007 executing the control program.

ステップＳ１１０１では、認識処理の開始に先立ち、ＣＰＵ１００７が各種初期化処理を実行する。ＣＰＵ１００７は、演算回路のＣＮＮ処理動作に必要なフィルタ係数をＲＯＭ１００８からＲＡＭ１００１に転送すると共に、演算回路１００２の動作、即ち階層的な結合関係を定義する為の各種レジスタ設定を行う。具体的には演算回路１００２の制御部６０１に存在する複数のレジスタに所定の値を設定する。同様に、ＣＰＵ１００７は、前処理モジュール１００５等のレジスタに対しても動作に必要な値を書き込む。 In step S1101, the CPU 1007 executes various initialization processes prior to the start of the recognition process. The CPU 1007 transfers the filter coefficient required for the CNN processing operation of the arithmetic circuit from the ROM 1008 to the RAM 1001, and sets various registers for defining the operation of the arithmetic circuit 1002, that is, the hierarchical coupling relationship. Specifically, predetermined values are set in a plurality of registers existing in the control unit 601 of the arithmetic circuit 1002. Similarly, the CPU 1007 writes a value necessary for operation to a register such as the preprocessing module 1005.

次に、ステップＳ１１０２で割当てモジュール１０１２は、各特徴面を算出する際の階層構造の分割を決定し、階層的な結合関係の分割情報を生成する。ここでは、並列に動作する演算回路の数等の条件に従って階層構造の分割を決定するが、具体的な分割手法は後述する。 Next, in step S1102, the allocation module 1012 determines the division of the hierarchical structure when calculating each feature surface, and generates the division information of the hierarchical connection relationship. Here, the division of the hierarchical structure is determined according to conditions such as the number of arithmetic circuits operating in parallel, and a specific division method will be described later.

初期化処理を行うステップＳ１１０１及び階層構造の分割を行うステップＳ１１０２の後に、一連の物体認識動作を開始する。まず、ステップＳ１１０３では画像入力モジュール１０００が、画像センサーの出力する信号を画像データに変換し、フレーム単位で図示しないが、画像入力モジュール１０００に内蔵するフレームバッファに格納する。フレームバッファへの画像データの格納が完了すると、所定の信号に基づいて、前処理モジュール１００５が画像変換処理を開始する。ステップＳ１１０４では、前処理モジュール１００５は前記フレームバッファ上の画像データから輝度データを抽出し、コントラスト補正処理を行う。 A series of object recognition operations are started after step S1101 for performing the initialization process and step S1102 for dividing the hierarchical structure. First, in step S1103, the image input module 1000 converts the signal output by the image sensor into image data and stores it in a frame buffer built in the image input module 1000, although not shown in frame units. When the storage of the image data in the frame buffer is completed, the preprocessing module 1005 starts the image conversion process based on the predetermined signal. In step S1104, the preprocessing module 1005 extracts the luminance data from the image data on the frame buffer and performs the contrast correction process.

輝度データの抽出は一般的な線形変換処理によりＲＧＢ画像データから輝度データを生成する。コントラスト補正の手法も一般的に知られているコントラスト補正処理を適用してコントラストを強調する。前処理モジュール１００５は階層的な結合関係の分割情報に基づき、コントラスト補正処理後の輝度データを検出用画像として、並列処理が振り分けられた演算回路１００２に対応するＲＡＭ１００１に格納する。１フレームの画像データに対して前処理が完了すると、前処理モジュール１００５は図示しない完了信号を有効にする。 The luminance data is extracted by generating the luminance data from the RGB image data by a general linear conversion process. The contrast correction method also applies a generally known contrast correction process to enhance the contrast. The preprocessing module 1005 stores the luminance data after the contrast correction processing as a detection image in the RAM 1001 corresponding to the arithmetic circuit 1002 to which the parallel processing is distributed, based on the division information of the hierarchical connection relationship. When the preprocessing for one frame of image data is completed, the preprocessing module 1005 enables a completion signal (not shown).

ステップＳ１１０５では、演算回路１００２は前処理モジュール１００５が有効にした完了信号に基づいて起動し、ＣＮＮ処理に基づく物体の検出処理を開始する。ステップＳ１１０５での処理はステップＳ１１０２で生成された階層な結合関係の分割情報に基づいて動作である。ステップＳ１１０６では、最終層の特徴面の算出を終了すると演算回路１００２はＣＰＵ１００７に対して完了割り込みを発生する。 In step S1105, the arithmetic circuit 1002 is activated based on the completion signal enabled by the preprocessing module 1005, and starts the object detection process based on the CNN process. The process in step S1105 is an operation based on the division information of the hierarchical connection relationship generated in step S1102. In step S1106, when the calculation of the characteristic surface of the final layer is completed, the arithmetic circuit 1002 generates a completion interrupt to the CPU 1007.

ステップＳ１１０７では、ＣＰＵ１００７は演算回路１００２の処理終了を示す完了割り込みを受信すると、最終層の特徴面を解析し、画像中の物体の位置や属性を判定する。ステップＳ１１０７の解析処理を完了すると、ステップＳ１１０８に進み、次のフレームの画像に対する処理が継続する各特徴面を算出する際の階層構造の分割を決定し、階層的な結合関係の分割情報を生成する。 In step S1107, when the CPU 1007 receives the completion interrupt indicating the end of processing of the arithmetic circuit 1002, the CPU 1007 analyzes the characteristic surface of the final layer and determines the position and attributes of the object in the image. When the analysis process of step S1107 is completed, the process proceeds to step S1108 to determine the division of the hierarchical structure when calculating each feature plane at which the processing for the image of the next frame continues, and generate the division information of the hierarchical connection relationship. do.

次に、図８を用いて、ステップＳ１１０２で割当てモジュール１０１２が、ＣＮＮ処理の階層構造を分割して、複数の特徴面の算出を複数の演算回路に並列処理させるために、階層的な結合関係の分割情報を生成する方法について述べる。本実施形態では焼きなまし法（シミュレーテッドアニーリング法）に基づく算出手法について説明する。まず、ステップＳ８０１で分割条件を取得する。ここでいう分割条件とは、使用可能な演算回路の個数に基づく分割数の指定や演算回路当たりで保持ないしは転送する必要のあるデータ量に基づく条件などである。次に、ステップＳ８０２で階層的な結合関係から任意の階層間を選択する。つまり直接結合されている入力層と出力層の組を選択する。選択した階層間の結合関係の分割をステップＳ８０３で決定する。そして、ステップＳ８０４では、すべての階層間で終了するまでステップＳ８０３までの処理を繰り返し行う。 Next, using FIG. 8, in step S1102, the allocation module 1012 divides the hierarchical structure of CNN processing so that the calculation of the plurality of feature planes is processed in parallel by the plurality of arithmetic circuits. The method of generating the division information of is described. In this embodiment, a calculation method based on simulated annealing (simulated annealing method) will be described. First, the division condition is acquired in step S801. The division condition referred to here is a specification of the number of divisions based on the number of usable arithmetic circuits, a condition based on the amount of data that needs to be held or transferred per arithmetic circuit, and the like. Next, in step S802, any layer is selected from the hierarchical connection relationship. That is, the pair of the input layer and the output layer that are directly connected is selected. The division of the connection relationship between the selected hierarchies is determined in step S803. Then, in step S804, the processes up to step S803 are repeated until the process is completed between all the layers.

図９は、ステップ８０３での選択した階層間の具体的な分割決定のプロセスを示すフローチャートである。 FIG. 9 is a flowchart showing a specific process of determining the division between the selected hierarchies in step 803.

まず、ステップＳ９０１では、ＣＮＮ処理における複数の畳込みフィルタ演算を割り当てることが可能な演算回路の数に基づいて、図５に示すように、割当てモジュール１０１２が初期の階層間の結合分割を決定する。次に、初期の階層的な結合関係の分割を解析する。その結果、図５に示す階層間の結合関係１２７に対して、図１０に示すように、割当てモジュール１０１２が出力側の特徴面１０７〜１１２の算出を、演算回路１への割当て１３０６および演算回路２への割り当て１３１５のように割り当てていることが分かった。この際、割当て１３０６に関して出力側の特徴面１０７〜１０９を算出するために必要なデータ１３０１は、演算回路１に対応する非図示のＲＡＭに保持する必要がある。具体的に、出力側の特徴面１０７を算出するために、参照側の特徴面１０１と畳込みフィルタカーネル１３０２、出力側の特徴面１０８を算出するために、参照側の特徴面１０２と畳込みフィルタカーネル１３０３が演算回路１のＲＡＭに保持される。また、出力側の特徴面１０９を算出するために、参照側の特徴面１０３と畳込みフィルタカーネル１３０４、参照側の特徴面１０５と畳込みフィルタカーネル１３０５が演算回路１のＲＡＭに保持される。 First, in step S901, the allocation module 1012 determines the coupling division between the initial layers, as shown in FIG. 5, based on the number of arithmetic circuits to which a plurality of convolution filter operations can be assigned in the CNN process. .. Next, the division of the initial hierarchical connection relationship is analyzed. As a result, with respect to the coupling relationship 127 between the layers shown in FIG. 5, as shown in FIG. 10, the allocation module 1012 calculates the characteristic surfaces 107 to 112 on the output side, and allocates 1306 to the arithmetic circuit 1 and the arithmetic circuit. Allocation to 2 It turned out that it was assigned like 1315. At this time, the data 1301 necessary for calculating the characteristic surfaces 107 to 109 on the output side with respect to the allocation 1306 needs to be held in a RAM (not shown) corresponding to the arithmetic circuit 1. Specifically, in order to calculate the feature surface 107 on the output side, the feature surface 101 on the reference side and the convolution filter kernel 1302, and in order to calculate the feature surface 108 on the output side, the feature surface 102 on the reference side and the convolution surface 102 are convoluted. The filter kernel 1303 is held in the RAM of the arithmetic circuit 1. Further, in order to calculate the feature surface 109 on the output side, the feature surface 103 on the reference side and the convolution filter kernel 1304, and the feature surface 105 on the reference side and the convolution filter kernel 1305 are held in the RAM of the arithmetic circuit 1.

また、演算回路２への割当て１３１５に関しては、出力側の特徴面１１０〜１１２を算出するために必要なデータ１３０８は、演算回路２に対応する非図示のＲＡＭに保持する必要がある。具体的に、出力側の特徴面１１０を算出するために、参照側の特徴面１０１と畳込みフィルタカーネル１３０９、参照側の特徴面１０６と畳込みフィルタカーネル１３１３が演算回路２のＲＡＭに保持される。同様に、出力側の特徴面１１１を算出するために、参照側の特徴面１０２と畳込みフィルタカーネル１３１０、参照側の特徴面１０４と畳込みフィルタカーネル１３１２が演算回路２のＲＡＭに保持される。また、出力側の特徴面１１２を算出するために、参照側の特徴面１０３と畳込みフィルタカーネル１３１１、参照側の特徴面１０６と畳込みフィルタカーネル１３１４が演算回路２のＲＡＭに保持される。 Further, regarding the allocation 1315 to the arithmetic circuit 2, the data 1308 necessary for calculating the characteristic surfaces 110 to 112 on the output side needs to be held in a RAM (not shown) corresponding to the arithmetic circuit 2. Specifically, in order to calculate the feature surface 110 on the output side, the feature surface 101 on the reference side and the convolution filter kernel 1309, and the feature surface 106 on the reference side and the convolution filter kernel 1313 are held in the RAM of the arithmetic circuit 2. NS. Similarly, in order to calculate the feature surface 111 on the output side, the feature surface 102 on the reference side and the convolution filter kernel 1310, and the feature surface 104 on the reference side and the convolution filter kernel 1312 are held in the RAM of the arithmetic circuit 2. .. Further, in order to calculate the feature surface 112 on the output side, the feature surface 103 on the reference side and the convolution filter kernel 1311 and the feature surface 106 on the reference side and the convolution filter kernel 1314 are held in the RAM of the arithmetic circuit 2.

ここで、演算回路１への割当て１３０６と演算回路２への割当て１３１５によって、それぞれのＲＡＭで保持する必要のあるデータを見てみると、参照側の特徴面１０１、１０２、１０３のデータが重複して保持するデータとなっている。なお、図５の例では、各演算回路への割当てられた特徴面の個数がバランスよくなるようにするために同じであるが、これに限ったものではなく、各演算回路へ割当てた特徴面の個数が異なってもよい。 Here, looking at the data that needs to be held in the respective RAMs due to the allocation 1306 to the arithmetic circuit 1 and the allocation 1315 to the arithmetic circuit 2, the data of the feature planes 101, 102, and 103 on the reference side are duplicated. It is the data to be retained. In the example of FIG. 5, the number of characteristic surfaces assigned to each arithmetic circuit is the same so as to be well-balanced, but the present invention is not limited to this, and the characteristic surfaces assigned to each arithmetic circuit are not limited to this. The number may be different.

次に、ステップＳ９０２で、階層処理間の異なる演算回路に割り当てられている出力側の特徴面を二つピックアップする。例えば、図１０の演算回路１への割当て１３０６と、演算回路２への割当て１３１５とから、それぞれ出力側の特徴面１０９および特徴面１１１をピックアップし、ステップＳ９０３にて、ピックアップした特徴面割当てを交換する。特徴面割当ての交換は、演算回路１に割当てられた特徴面１０９の算出を演算回路２に割当て、演算回路２に割当てられた特徴面１１１の算出を演算回路１に割当てるように結合関係の分割情報を変更する処理である。 Next, in step S902, two characteristic surfaces on the output side assigned to different arithmetic circuits between the hierarchical processes are picked up. For example, the feature surface 109 and the feature surface 111 on the output side are picked up from the allocation 1306 to the arithmetic circuit 1 and the allocation 1315 to the arithmetic circuit 2 in FIG. 10, respectively, and the picked up feature surface allocation is performed in step S903. Exchange. In the exchange of feature surface allocation, the coupling relationship is divided so that the calculation of the feature surface 109 assigned to the calculation circuit 1 is assigned to the calculation circuit 2 and the calculation of the feature surface 111 assigned to the calculation circuit 2 is assigned to the calculation circuit 1. This is the process of changing information.

二つの演算回路に割当てられた出力側の特徴面を交換した後のそれぞれの演算回路のＲＡＭに保持するデータについて図１１を用いて説明する。交換後の演算回路１及び演算回路２への算出する出力側の特徴面の割当ては割当て１４０２及び割当て１４０５のようになる。交換後の結合関係を解析すると、割当て１４０２に関して、出力側の特徴面１０７、１０８、１１１を作成するために必要なデータはデータ１４０１である。即ち、出力側の特徴面１０７を算出するために、参照側の特徴面１０１と畳込みフィルタカーネル１３０２、出力側の特徴面１０８を算出するために、参照側の特徴面１０２と畳込みフィルタカーネル１３０３を演算回路１のＲＡＭに保持する必要がある。また、出力側の特徴面１１１を算出するために、参照側の特徴面１０２と畳込みフィルタカーネル１３１０、参照側の特徴面１０４と畳込みフィルタカーネル１３１２を演算回路１のＲＡＭに保持する必要がある。 The data held in the RAM of each arithmetic circuit after exchanging the characteristic planes on the output side assigned to the two arithmetic circuits will be described with reference to FIG. The allocation of the characteristic surface on the output side to be calculated for the arithmetic circuit 1 and the arithmetic circuit 2 after the exchange is as shown in the allocation 1402 and the allocation 1405. When the coupling relationship after the exchange is analyzed, the data required to create the feature planes 107, 108, and 111 on the output side with respect to the allocation 1402 is the data 1401. That is, in order to calculate the feature surface 107 on the output side, the feature surface 101 on the reference side and the convolution filter kernel 1302, and in order to calculate the feature surface 108 on the output side, the feature surface 102 on the reference side and the convolution filter kernel It is necessary to hold 1303 in the RAM of the arithmetic circuit 1. Further, in order to calculate the feature surface 111 on the output side, it is necessary to hold the feature surface 102 on the reference side and the convolution filter kernel 1310, and the feature surface 104 on the reference side and the convolution filter kernel 1312 in the RAM of the arithmetic circuit 1. be.

また、割当て１４０５に関しては、出力側の特徴面１１０、１０９、１１２を生成するために必要なデータはデータ１４０４である。即ち、出力側の特徴面１１０を算出するために、参照側の特徴面１０１とフィルタカーネル１３０９、参照側の特徴面１０６とフィルタカーネル１３１３を演算回路２のＲＡＭに保持する必要がある。また、出力側の特徴面１０９を算出するために、参照側の特徴面１０３とフィルタカーネル１３０４、参照側の特徴面１０５とフィルタカーネル１３０５を演算回路２のＲＡＭに保持する必要がある。また、出力側の特徴面１１２を算出するために、参照側の特徴面１０３とフィルタカーネル１３１１、参照側の特徴面１０６とフィルタカーネル１３１４を演算回路２のＲＡＭに保持する必要がある。 Regarding the allocation 1405, the data required to generate the feature planes 110, 109, 112 on the output side is the data 1404. That is, in order to calculate the feature surface 110 on the output side, it is necessary to hold the feature surface 101 on the reference side and the filter kernel 1309, and the feature surface 106 on the reference side and the filter kernel 1313 in the RAM of the arithmetic circuit 2. Further, in order to calculate the feature surface 109 on the output side, it is necessary to hold the feature surface 103 on the reference side and the filter kernel 1304, and the feature surface 105 on the reference side and the filter kernel 1305 in the RAM of the arithmetic circuit 2. Further, in order to calculate the feature surface 112 on the output side, it is necessary to hold the feature surface 103 on the reference side and the filter kernel 1311, and the feature surface 106 on the reference side and the filter kernel 1314 in the RAM of the arithmetic circuit 2.

ここで、割当て１４０２と割当て１４０５でそれぞれの演算回路のＲＡＭで保持する必要のあるデータを見てみると、参照側の特徴面１０１のみが重複して保持するデータとなっている。図１０で示した割当て１３０６と割当て１３１５と比べると、ＲＡＭに保持する必要なデータ数を特徴面の数で比較すると、９面から７面へと少なくさせることが可能である。 Here, looking at the data that needs to be held in the RAMs of the respective arithmetic circuits in the allocation 1402 and the allocation 1405, only the characteristic surface 101 on the reference side is duplicated and retained. Compared with the allocation 1306 and the allocation 1315 shown in FIG. 10, when the number of data required to be held in the RAM is compared by the number of feature planes, it is possible to reduce the number from 9 planes to 7 planes.

次に、ステップＳ９０４にて評価値を算出する。ここで、評価値は、それぞれの演算回路のＲＡＭで重複して保持するデータのデータ量および、ペナルティ値で構成される。ペナルティ値は、ステップＳ８０１で取得した分割条件に基づき決定する。分割条件は、例えば、演算回路の個数や演算回路一つあたりで割当て（分割）可能な演算処理量である。演算回路の演算処理量は、参照側の特徴面、フィルタカーネルのサイズや演算サイクル数などの条件によって算出される。また、それぞれの演算回路のＲＡＭで重複して保持する特徴面の許容数、各演算回路の処理負荷分散条件などを考慮して分割条件を決定してもよい。本実施形態では、説明の簡単化のために、演算回路一つあたりで処理可能な演算処理量として参照側の特徴面サイズとフィルタカーネルのサイズの合計値を用いて分割条件例を説明する。 Next, the evaluation value is calculated in step S904. Here, the evaluation value is composed of a data amount of data duplicated and held in the RAM of each arithmetic circuit and a penalty value. The penalty value is determined based on the division condition acquired in step S801. The division condition is, for example, the number of arithmetic circuits or the amount of arithmetic processing that can be allocated (divided) per arithmetic circuit. The calculation processing amount of the calculation circuit is calculated based on conditions such as the characteristic surface of the reference side, the size of the filter kernel, and the number of calculation cycles. Further, the division condition may be determined in consideration of the allowable number of characteristic surfaces to be duplicated and held in the RAM of each arithmetic circuit, the processing load distribution condition of each arithmetic circuit, and the like. In the present embodiment, for the sake of simplification of the description, an example of the division condition will be described using the total value of the feature plane size on the reference side and the size of the filter kernel as the amount of arithmetic processing that can be processed per arithmetic circuit.

まず、評価値算出にあたり、分割数がａ個の時、割当てｌと割当てｋ間で重複して保持すべきデータ量をｎ（ｌ，ｋ）とすると、全割当てにおいて重複して保持すべきデータ量の合計ｓは First, in calculating the evaluation value, when the number of divisions is a and the amount of data to be retained in duplicate between the allocation l and the allocation k is n (l, k), the data to be retained in duplicate in all allocations. The total amount s is

と表すことができる。

It can be expressed as.

次に、ペナルティ値について説明する。割当てｉを処理するために必要な参照側の特徴面総サイズをｘ_ｉと、フィルタカーネルの総サイズをｗ_ｉとすると、演算回路一つあたりでの必要データサイズの合計ｔ_ｉはｔ_ｉ＝ｘ_ｉ＋ｗ_ｉと表すことができる。前述のとおり、本実施形態では演算回路一つあたりで処理可能な参照側の特徴面サイズとフィルタカーネルのサイズの合計値を分割条件として扱うため、その条件値をｔｈとすると、各ｔ_ｉがｔｈ以下か否かを比較し、ｔｈを超えた場合は各割当て毎のペナルティ値ｐ_ｉにＣを与える。 Next, the penalty value will be described. And x _i the reference side wherein surface total size required to process the assignment i, when the total size of the filter kernel and w _i, the sum t _i of necessary data size in the arithmetic circuit for each one is t _i = can be expressed as x _{_i} + _w _i. As described above, in the present embodiment deals with the total value of the feature surface size and the size of the filter kernel can be processed reference side per one operation circuit as a split condition, when the condition value th, each t _i is th comparing whether less, if it exceeds th give C penalty value p _i for each assignment.

つまり、全割当てでのペナルティ値の合計は

In other words, the total penalty value for all allocations is

以上より評価値ｆは以下のように表すことができる。

From the above, the evaluation value f can be expressed as follows.

ｆ＝ｓ＋Ｐ
ステップＳ９０５では、ステップＳ９０４で算出した評価値に基づいて、評価値が割当て前より良い場合、割当ての変更を採用し、そうでない場合は交換前の割当てに戻す。ここで、割当て変更の基準は、変更前後で評価値が良悪だけではなく、ｎ％以上良い場合なら変更するなどの幅を持った閾値で採用選択を行ってもよい。 f = s + P
In step S905, based on the evaluation value calculated in step S904, if the evaluation value is better than that before the allocation, the change of the allocation is adopted, and if not, the allocation is returned to the allocation before the exchange. Here, as the criterion for changing the allocation, not only the evaluation value is good or bad before and after the change, but also the adoption selection may be made with a threshold value having a range such as changing if the evaluation value is n% or more good.

ステップＳ９０６で事前に設定した繰り返し回数や時間、それぞれの演算回路のＲＡＭで重複して保持する特徴面の数、特徴面の交換による重複データ量削減率などの制約条件を満たすまでステップＳ９０２からステップＳ９０５までの処理を繰り返し行う。例えば、それぞれの演算回路のＲＡＭで重複して保持する特徴面の数が所定数以下になるまで、ステップＳ９０２からステップＳ９０５までの処理を繰り返す。また、焼きなまし法に基づく場合は、その収束条件に従うことも可能である。また、本実施形態は、焼きなまし方に限らず、遺伝的アルゴリズムなどに代表される進化的アルゴリズムを持ちしても良い。 Steps from step S902 until constraint conditions such as the number of repetitions and time preset in step S906, the number of feature faces duplicated and held in the RAM of each arithmetic circuit, and the duplicate data amount reduction rate by exchanging feature faces are satisfied. The process up to S905 is repeated. For example, the processes from step S902 to step S905 are repeated until the number of feature planes that are duplicated and held in the RAM of each arithmetic circuit becomes a predetermined number or less. Moreover, when it is based on simulated annealing, it is also possible to follow the convergence condition. Further, the present embodiment is not limited to simulated annealing, and may have an evolutionary algorithm represented by a genetic algorithm or the like.

本実施形態の方法によって、既存の階層的な結合関係に応じて、各演算回路への複数の畳込みフィルタ演算の最適な分割（割当て）を行うことができる。最適な分割を行うことで、各演算回路のメモリで重複して保持するデータの量が少なくなり、データ転送やデータ読み取りの時間が短縮されて、効率よい並列処理が可能となる。また、本実施形態は、学習済みの階層的な結合関係に対して適用することが可能であるため柔軟性が高いといえる。 According to the method of the present embodiment, it is possible to optimally divide (assign) a plurality of convolution filter operations to each arithmetic circuit according to the existing hierarchical coupling relationship. By performing the optimum division, the amount of data duplicated and held in the memory of each arithmetic circuit is reduced, the time for data transfer and data reading is shortened, and efficient parallel processing becomes possible. Further, it can be said that this embodiment has high flexibility because it can be applied to a learned hierarchical connection relationship.

本実施形態ではＣＮＮ処理の場合について説明したが、本実施形態の方法は、ＲｅｓｔｒｉｃｔｅｄＢｏｌｔｚｍａｎｎＭａｃｈｉｎｅｓやＲｅｃｕｒｓｉｖｅＮｅｕｒａｌＮｅｔｗｏｒｋ等の他の階層的な処理にも適用可能である。 Although the case of CNN processing has been described in this embodiment, the method of this embodiment can be applied to other hierarchical processing such as Restricted Boltzmann Machines and Recurrent Neural Network.

また、本実施形態では２次元の特徴データである特徴面に対する階層的な演算処理の例について説明したが、音声データ等の１次元の特徴データや時間の変化を含めた３次元の特徴データに対するＣＮＮ処理等の階層的な演算処理に適用することも可能である。 Further, in the present embodiment, an example of hierarchical arithmetic processing for a feature surface which is two-dimensional feature data has been described, but for one-dimensional feature data such as voice data and three-dimensional feature data including time changes. It can also be applied to hierarchical arithmetic processing such as CNN processing.

（第２の実施形態）
次に、本発明の第２の実施形態について説明する。本実施形態のハード構成は、第１の実施形態と同じであるので、その説明を省略する。 (Second Embodiment)
Next, a second embodiment of the present invention will be described. Since the hardware configuration of this embodiment is the same as that of the first embodiment, the description thereof will be omitted.

本実施形態では、割当てモジュール１０１２が階層間の結合関係を分割する際に、階層間の結合関係をグラフと見立て、最大フロー最小カット定理に基づいてグラフをカットすることで階層間での結合の分割を決定する。本実施形態の分割方法を図１２および図１３を用いて説明する。図１３は図８のフローチャート上の階層間分割決定ステップＳ８０３の別の実施形態を表している。本実施形態でいうグラフとは階層的な結合関係を表している。また、以下では階層的な結合関係をグラフとして扱うに当たり、特徴面をグラフの結合点として、特徴面の結合関係を入力層から出力層に向けて有効グラフとする。 In the present embodiment, when the allocation module 1012 divides the connection relationship between layers, the connection relationship between layers is regarded as a graph, and the graph is cut based on the maximum flow min-cut theorem to form a connection between layers. Decide on a split. The division method of this embodiment will be described with reference to FIGS. 12 and 13. FIG. 13 represents another embodiment of the inter-layer division determination step S803 on the flowchart of FIG. The graph in the present embodiment represents a hierarchical connection relationship. Further, in the following, when the hierarchical connection relationship is treated as a graph, the feature surface is used as the connection point of the graph, and the connection relationship of the feature surfaces is used as an effective graph from the input layer to the output layer.

ステップＳ１６０２では、分割対象である階層的な結合関係を選択する。本処理は分割統治的に繰り返しグラフを分割していくことを想定した説明となるため、すべての条件を満たすグラフになるまでグラフを繰り返し処理していく。つまり、条件を満たしていないグラフをこのステップで選択する。 In step S1602, the hierarchical connection relationship to be divided is selected. Since this process is based on the assumption that the graph will be divided repeatedly in a divide-and-rule manner, the graph will be iteratively processed until all the conditions are met. That is, the graph that does not meet the conditions is selected in this step.

次に、ステップＳ１６０３では選択したグラフに対して、グラフカットを行うための整形を行う。本ステップを説明するにあたり、図５に示すような階層的な結合関係を例に説明する。図１２は図５の階層関係に対して、各特徴面１０１−１１２を結合点と見立て、ｓ点１５０１およびｔ点１５１４をそれぞれ送信点、受信点として追加する。また、ｓ点と全結合点を結合点方向に接続、ｔ点と結合点をｔ点方向に接続してある。ここで矢印１１３−１２２は参照側の特徴面の階層１２３および出力側の特徴面の階層１２４間での結合関係を表しているが、それぞれの矢印に対して参照側の特徴面のサイズおよびカーネルフィルタの係数の合計値を重みとして与える。ここで矢印の重みを特徴面のサイズおよびカーネルフィルタの係数の合計値として説明したが、これに限るものではない。矢印１５０２−１５１３および矢印１５１５−１５２６に対しては、矢印１１３−１２２の重みに対して十分大きな値とする。 Next, in step S1603, the selected graph is shaped to perform graph cutting. In explaining this step, a hierarchical connection relationship as shown in FIG. 5 will be described as an example. In FIG. 12, with respect to the hierarchical relationship of FIG. 5, each feature surface 101-112 is regarded as a coupling point, and s point 1501 and t point 1514 are added as transmission points and reception points, respectively. Further, the s point and all the connection points are connected in the direction of the connection point, and the t point and the connection point are connected in the direction of the t point. Here, arrows 113-122 represent the coupling relationship between the layer 123 of the feature surface on the reference side and the layer 124 of the feature surface on the output side, and the size and kernel of the feature surface on the reference side for each arrow. The total value of the filter coefficients is given as a weight. Here, the weight of the arrow is explained as the total value of the size of the feature plane and the coefficient of the kernel filter, but the present invention is not limited to this. For arrows 1502-1513 and arrows 1515-1526, the values are sufficiently large with respect to the weights of arrows 113-122.

ステップＳ１６０４では、前ステップで作成した階層間の関係をグラフ化したものに対して、最大フロー最小カットのような、古典的なグラフカット手法を用いてグラフを切断する。グラフカットによって、十分に大きい重みの矢印１５０２−１５１３や矢印１５１５−１５２６間は切断されずに矢印１１３−１２２間で、最も重みが小さい結合関係で入力層及び出力層が切断されることとなる。 In step S1604, the graph is cut by using a classical graph cut method such as the maximum flow minimum cut for the graphed relationship between the layers created in the previous step. The graph cut does not cut between arrows 1502-1513 and arrows 1515-1526 with sufficiently large weights, but cuts the input layer and output layer between arrows 113-122 with the least weighted coupling relationship. ..

ステップＳ１６０５では、カットしたグラフが分割条件を満たすかを確認し、条件を満たすまで、ステップＳ１６０２−Ｓ１６０４を繰り返し実行する。以下で説明している分割条件は図８のステップＳ８０１で取得することが可能である。ここで分割条件とは、実施形態１と同様、カットされた全ての分割内で処理に必要なデータ量の所定値などである。通常、処理に必要なデータは、演算回路のＲＡＭで保持するので、ＲＡＭで保持するデータのデータ量が、所定値を超えていないかどうかを確認し、所定値に収まらない場合は該グラフに対して再度カットする処理を行う。また、カットされたグラフの個数を規定する場合は、カット済みのグラフをランダムに複数選択、マージし、マージしたグラフに対して再度グラフカットを行う。これを繰り返すことで、規定数になるように繰り返し行うことも可能である。本実施形態の処理によって、第１の実施形態と同様に図６に示す分割結果が得られることができる。本実施形態の分割方法は、各演算回路のＲＡＭで重複して保持するデータのデータ量を最も小さくすることができる。また、演算回路のＲＡＭで保持するデータのデータ量が所定値を超えないようにすることができる。 In step S1605, it is confirmed whether the cut graph satisfies the division condition, and steps S1602-S1604 are repeatedly executed until the condition is satisfied. The division conditions described below can be obtained in step S801 of FIG. Here, the division condition is a predetermined value of the amount of data required for processing in all the cut divisions, as in the first embodiment. Normally, the data required for processing is held in the RAM of the arithmetic circuit. Therefore, check whether the amount of data held in the RAM exceeds the predetermined value, and if it does not fit in the predetermined value, display the graph. On the other hand, the process of cutting again is performed. When specifying the number of cut graphs, a plurality of cut graphs are randomly selected and merged, and the merged graph is cut again. By repeating this, it is possible to repeat the process so as to reach the specified number. By the processing of the present embodiment, the division result shown in FIG. 6 can be obtained as in the first embodiment. The division method of the present embodiment can minimize the amount of data to be duplicated and held in the RAM of each arithmetic circuit. Further, the amount of data held in the RAM of the arithmetic circuit can be prevented from exceeding a predetermined value.

（第３の実施形態）
次に、本発明の第３の実施形態について説明する。本実施形態では、所定の条件に基づいて機器内で、階層的な結合関係を動的に分割して畳込みフィルタ演算を行う手法について説明する。本実施形態のハード構成は、第１の実施形態と同じであるので、その説明を省略する。本実施形態では、割当てモジュール１０１２が、所定のタイミングで演算回路１００２−１〜１００２−ｎの稼働状態を確認する。 (Third Embodiment)
Next, a third embodiment of the present invention will be described. In the present embodiment, a method of dynamically dividing the hierarchical connection relationship and performing the convolution filter calculation in the device based on a predetermined condition will be described. Since the hardware configuration of this embodiment is the same as that of the first embodiment, the description thereof will be omitted. In the present embodiment, the allocation module 1012 confirms the operating state of the arithmetic circuits 1002-1 to 1002-n at a predetermined timing.

割当てモジュール１０１２は、それぞれの演算回路１００２−１〜１００２−ｎの稼働状態に応じて、動的に分割する。割当てモジュール１０１２は、結合関係の分割結果を出力し、ＣＰＵ１００７はそれに基づいて、それぞれの演算回路１００２−１〜１００２−ｎのＲＡＭに参照側の特徴面やフィルタカーネルを随時転送する。 The allocation module 1012 is dynamically divided according to the operating state of each arithmetic circuit 1002-1 to 1002-n. The allocation module 1012 outputs the division result of the coupling relationship, and based on this, the CPU 1007 transfers the feature surface and the filter kernel on the reference side to the RAMs of the respective arithmetic circuits 1002-1 to 1002-n at any time.

これにより、例えば、あるタイミングで演算回路の稼働状態を確認した結果、Ｎ個ある演算回路のうち、Ｍ個の演算回路が処理を終了している場合、このＭ個の演算回路に未処理のＣＮＮ演算を割り当てることができる。その結果、演算回路の稼働状態は変動するような場合でも、適宜最適な分割によって効率よく並列処理することが可能となる。 As a result, for example, as a result of checking the operating state of the arithmetic circuits at a certain timing, if M arithmetic circuits have completed processing among the N arithmetic circuits, the M arithmetic circuits are not processed. CNN operations can be assigned. As a result, even if the operating state of the arithmetic circuit fluctuates, it is possible to efficiently perform parallel processing by appropriately and optimally dividing the operation state.

（第４の実施形態）
次に、本発明の第４の実施形態について説明する。本実施形態では、階層的な結合関係が動的に変更される場合にも有効である。図１４は、階層的な結合関係が動的に変更している様子を示している。 (Fourth Embodiment)
Next, a fourth embodiment of the present invention will be described. This embodiment is also effective when the hierarchical connection relationship is dynamically changed. FIG. 14 shows how the hierarchical connection relationship is dynamically changing.

通常時のＣＮＮ処理は、図４の階層的な結合に対して行う。しかし、撮像状況や特徴抽出の途中結果などの内的要因や機器内の演算リソースや消費電力などの物理要因に応じて、図１４の破線５０１、５０２、５０３で示す結合関係の構造や畳込みフィルタ係数の変更、特徴面に関する動的な変更にも、適応可能である。 The normal CNN process is performed on the hierarchical combination shown in FIG. However, depending on internal factors such as imaging conditions and intermediate results of feature extraction, and physical factors such as computing resources and power consumption in the device, the structure and convolution of the coupling relationship shown by the broken lines 501, 502, and 503 in FIG. 14 It can also be applied to changes in filter coefficients and dynamic changes in feature planes.

撮像状況や特徴抽出の途中結果によって変動が生じる場合として、認識処理の途中結果から、識別対象が人物である確率が所定値より高くなった時に、破線５０１〜５０３で示す部分が動物に関する特徴量がメインであれば、破線で示す部分の処理を行わない。 Assuming that fluctuations occur depending on the imaging status and the intermediate result of feature extraction, when the probability that the identification target is a person is higher than a predetermined value from the intermediate result of the recognition process, the part indicated by the broken lines 501 to 503 is the feature amount related to the animal. If is the main, the part indicated by the broken line is not processed.

また、機器内の演算リソースや消費電力などの優先度に応じて、認識処理の結果に対して大きく影響を及ぼさない範囲で、破線５０１〜５０３で示す部分を処理しない場合がある。 Further, depending on the priority such as the calculation resource and the power consumption in the device, the portion indicated by the broken lines 501 to 503 may not be processed within a range that does not significantly affect the result of the recognition process.

この場合、ＣＰＵ１００７が、変更後の階層的な結合関係を割当てモジュール１０１２に通知し、割当てモジュール１０１２が変更された結合関係を分割する。ＣＰＵ１００７は、割当てモジュール１０１２の動的に分割した結合関係に基づいて各演算回路のメモリであるＲＡＭにデータを供給することで、動的に並列処理を実行する。 In this case, the CPU 1007 notifies the allocation module 1012 of the changed hierarchical connection relationship, and the allocation module 1012 divides the changed connection relationship. The CPU 1007 dynamically executes parallel processing by supplying data to the RAM, which is the memory of each arithmetic circuit, based on the dynamically divided coupling relationship of the allocation module 1012.

（第５の実施形態）
次に、本発明の第５の実施形態について説明する。これまでに説明してきた実施形態では、組込み向け機器についての実施形態であったが、本実施形態ではクラウドサーバーシステムでの実施について説明する。図１５は、クラウドサーバーシステムを表した図である。 (Fifth Embodiment)
Next, a fifth embodiment of the present invention will be described. In the embodiments described so far, the embodiments have been for embedded devices, but in the present embodiment, the embodiments in the cloud server system will be described. FIG. 15 is a diagram showing a cloud server system.

制御ＰＣ１７０１が、全てのＣＮＮ処理ＰＣ１７０５〜１７１３の制御を、通信ネットワーク１７０２〜１７０４を介して行う。ここでいう制御とは、階層の分割情報に基づく、並列処理をするために、参照側の特徴面データ・重みカーネルのＣＮＮ処理ＰＣ１７０５〜１７１３への割当て、およびこの割当てに基づく分散データ処理制御である。 The control PC 1701 controls all the CNN processing PCs 1705 to 1713 via the communication network 1702 to 1704. The control referred to here is the allocation of the feature plane data / weight kernel on the reference side to the CNN processing PCs 1705 to 1713 in order to perform parallel processing based on the division information of the hierarchy, and the distributed data processing control based on this allocation. be.

具体的には、制御ＰＣ１７０１は、図７や図８などで説明した方法により、階層構造の分割を実施し、各ＣＮＮ処理ＰＣへの特徴面などの割当てを決定する。第１の実施形態と同様に、各ＣＮＮ処理ＰＣで重複して保持する必要があるデータを少なくなるように階層構造の分割方法が決定されるので、ＣＮＮ処理ＰＣ１７０５−１７１３へのデータ供給量が少なく、並列演算が効率的に行われる。 Specifically, the control PC 1701 divides the hierarchical structure by the method described with reference to FIGS. 7 and 8, and determines the allocation of characteristic surfaces and the like to each CNN processing PC. Similar to the first embodiment, the method of dividing the hierarchical structure is determined so as to reduce the amount of data that needs to be duplicated and held in each CNN processing PC, so that the amount of data supplied to the CNN processing PC 1705-1713 can be increased. There are few parallel operations, and parallel operations are performed efficiently.

制御ＰＣ１７０１の決定した分割情報に基づき、それぞれのＣＮＮ処理ＰＣ１７０５−１７１３が畳込み演算を行い、演算結果を制御ＰＣへと出力する。これにより、組込み機器内のみならずクラウドサーバーシステムにおける並列化処理も効率的に施行することができる。 Based on the division information determined by the control PC 1701, each CNN processing PC 1705-1713 performs a convolution operation and outputs the operation result to the control PC. As a result, parallel processing can be efficiently performed not only in the embedded device but also in the cloud server system.

（その他の実施形態）
本発明は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 (Other embodiments)
The present invention supplies a program that realizes one or more functions of the above-described embodiment to a system or device via a network or storage medium, and one or more processors in the computer of the system or device reads and executes the program. It can also be realized by the processing to be performed. It can also be realized by a circuit (for example, ASIC) that realizes one or more functions.

１０００画像入力モジュール
１００１−１〜１００１−ｎＲＡＭ
１００２−１〜１００２−ｎ演算回路
１０１２割当てモジュール
１００７ＣＰＵ
１００８ＲＯＭ
１００９ＲＡＭ 1000 Image Input Module 1001-1 to 1001-n RAM
1002-1 to 1002-n arithmetic circuit 1012 allocation module 1007 CPU
1008 ROM
1009 RAM

Claims

Multiple arithmetic circuits that perform hierarchical arithmetic processing in parallel,
A plurality of operations in a predetermined layer in the arithmetic process, and each of the plurality of operations that refers to at least a part of the operation results in the layers prior to the predetermined layer is assigned to the plurality of arithmetic circuits. Has an allocation means and
The allocation means
An evaluation value corresponding to the amount of overlapping data between the plurality of arithmetic circuits related to the arithmetic result in the layer prior to the predetermined layer to be referred to at the time of calculation is calculated for the allocation candidate.
The evaluation value is an allocation candidate satisfying a predetermined criterion, and the duplication of the calculation result in the hierarchy before the predetermined hierarchy referred to in the calculation among the plurality of arithmetic circuits is at least from the first allocation candidate. An arithmetic processing unit, characterized in that a small number of second allocation candidates are determined as allocations of the plurality of operations to the plurality of arithmetic circuits.

The arithmetic processing unit according to claim 1, wherein each of the plurality of arithmetic circuits has a storage means for storing arithmetic results in a hierarchy prior to the predetermined hierarchical reference when performing arithmetic.

The arithmetic processing unit according to claim 1 or 2, wherein the arithmetic processing is an arithmetic processing by a convolutional neural network.

The arithmetic processing unit according to any one of claims 1 to 3, wherein the allocation means allocates each of the plurality of operations to the plurality of arithmetic circuits based on a simulated annealing method.

One of claims 1 to 4, wherein the allocating means allocates each of the plurality of operations to the plurality of arithmetic circuits so that the difference in the amount of operations between the plurality of arithmetic circuits is small. The arithmetic processing unit described in 1.

The arithmetic processing according to claim 2, wherein the allocating means allocates each of the plurality of operations to the plurality of arithmetic circuits so that the amount of data stored in the storage means does not exceed a predetermined value. Device.

The allocation means performs each of the plurality of operations so that the duplication of the calculation results in the layers prior to the predetermined layer referred to in the calculation among the plurality of calculation circuits is less than a predetermined number. The arithmetic processing apparatus according to any one of claims 1 to 6, wherein the arithmetic processing apparatus is assigned to an arithmetic circuit.

Further, it has a first confirmation means for confirming the operating state of the plurality of arithmetic circuits.
Any of claims 1 to 7, wherein the allocation means changes the allocation of the plurality of operations to the plurality of arithmetic circuits based on the operating state confirmed by the first confirmation means. The arithmetic processing unit according to item 1.

The arithmetic processing unit according to any one of claims 1 to 8, wherein the arithmetic circuit performs a convolution operation between an arithmetic result in a layer prior to the predetermined layer and a filter kernel.

The allocation means indicates that the evaluation value of the second allocation candidate is duplicated and held between the plurality of arithmetic circuits is a predetermined number or less, or the allocation means in the first allocation candidate. The second is that the reduction rate of the amount of data duplicated and held between the plurality of arithmetic circuits in the second allocation candidate is equal to or greater than a predetermined value with respect to the amount of data duplicated and held among the plurality of arithmetic circuits. When the evaluation value of the allocation candidate of is indicated, the second allocation candidate is determined as the allocation of the plurality of operations to the plurality of arithmetic circuits, according to any one of claims 1 to 9. The arithmetic processing device described.

The evaluation value according to any one of claims 1 to 10, wherein the evaluation value has a penalty value according to the number of the plurality of arithmetic circuits and the amount of arithmetic processing that can be assigned to the plurality of arithmetic circuits. Arithmetic processing unit.

A plurality of operations in a predetermined layer in the arithmetic processing for a plurality of arithmetic circuits that perform hierarchical arithmetic processing in parallel, and at least a part of a plurality of operation results in a layer prior to the predetermined layer. In the allocation of each of the plurality of operations with reference to
An evaluation value corresponding to the amount of overlapping data between the plurality of arithmetic circuits related to the arithmetic result in the layer prior to the predetermined layer to be referred to at the time of calculation is calculated for the allocation candidate.
The evaluation value is an allocation candidate satisfying a predetermined criterion, and the duplication of the calculation result in the hierarchy before the predetermined hierarchy referred to in the calculation among the plurality of arithmetic circuits is at least from the first allocation candidate. An arithmetic processing method characterized in that a small number of second allocation candidates are determined as allocations of the plurality of operations to the plurality of arithmetic circuits.

A program for operating a computer as the arithmetic processing unit according to any one of claims 1 to 11.