JP7467786B2

JP7467786B2 - DATA PROCESSING APPARATUS AND DATA PROCESSING METHOD

Info

Publication number: JP7467786B2
Application number: JP2020043230A
Authority: JP
Inventors: 洋一富岡; セドゥーキンスタニスラフ
Original assignee: University of Aizu
Current assignee: University of Aizu
Priority date: 2020-03-12
Filing date: 2020-03-12
Publication date: 2024-04-16
Anticipated expiration: 2040-03-12
Also published as: JP2021144519A

Description

特許法第３０条第２項適用（１）令和１年７月７日に８ｔｈＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｇｒｅｓｓｏｎＡｄｖａｎｃｅｄＡｐｐｌｉｅｄＩｎｆｏｒｍａｔｉｃｓの予稿集にて発表。（２）令和１年７月７日に８ｔｈＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｇｒｅｓｓｏｎＡｄｖａｎｃｅｄＡｐｐｌｉｅｄＩｎｆｏｒｍａｔｉｃｓにて発表。（３）令和１年１１月２７日にＩｏＴワークショップ「センシング・エッジによるＩｏＴ革新的ビジネスの潮流」～２０１９年度第１回産業・インフラ向けＩｏＴデバイス・システムの進展と活用事例～にて発表。（４）令和２年２月１３日にｈｔｔｐｓ：／／ｉｅｅｅｘｐｌｏｒｅ．ｉｅｅｅ．ｏｒｇ／ｄｏｃｕｍｅｎｔ／８９９２６４０にて発表。Article 30, paragraph 2 of the Patent Act applies. (1) Presented in the proceedings of the 8th International Congress on Advanced Applied Informatics on July 7, 2019. (2) Presented at the 8th International Congress on Advanced Applied Informatics on July 7, 2019. (3) Presented at the IoT Workshop "Trends in Innovative IoT Businesses Using Sensing Edge" - Progress and Use Cases of IoT Devices and Systems for Industry and Infrastructure in 2019, 1st Annual Meeting, on November 27, 2019. (4) Presented at https://ieeeexplorer.ieee. Published at org/document/8992640.

本発明は、データ処理装置及びデータ処理方法に関し、特に、畳み込みニューラルネットワークにおける畳み込み演算に適したデータ処理装置及びデータ処理方法に関する。 The present invention relates to a data processing device and a data processing method, and in particular to a data processing device and a data processing method suitable for convolution operations in a convolutional neural network.

近年、ニューラルネットワークに畳み込み(Ｃｏｎｖｏｌｕｔｉｏｎ）を追加した畳み込みニューラルネットワーク（ＣＮＮ：ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ）が、画像認識等に有効な機械学習として広く認識されている。以下、ＣＮＮの概略について説明を行う。 In recent years, convolutional neural networks (CNNs), which add convolution to neural networks, have been widely recognized as an effective machine learning technique for image recognition and other purposes. Below, we provide an overview of CNNs.

図１は、ＣＮＮのシステム構成の概略について説明する図である。図１に示すＣＮＮにおいて、レイヤーＬ１及びレイヤーＬ２は、畳み込み層（ＣｏｎｖｏｌｕｔｉｏｎａｌＬａｙｅｒ）及びプーリング層（ＰｏｏｌｉｎｇＬａｙｅｒ）をそれぞれ含む。 Figure 1 is a diagram illustrating an outline of the system configuration of a CNN. In the CNN shown in Figure 1, layer L1 and layer L2 each include a convolutional layer and a pooling layer.

畳み込み層は、入力データに対してフィルタ（Ｋｅｒｎｅｌ）の特徴重みを乗算する演算（特徴量を畳み込む演算）を行う層である。具体的に、畳み込み層では、入力データが画像データである場合、入力データに対してそれぞれ異なるフィルタの特徴重みを乗算することによって、フィルタの数に対応する画像データを得る。すなわち、畳み込み層では、複数のフィルタを使うことによって、入力データ（画像データ）のさまざまな特徴が捉えることが可能な入力データ内のパターンを検出することが可能になる。 The convolutional layer is a layer that performs an operation to multiply the input data by the feature weights of the filters (kernels) (an operation to convolve the features). Specifically, when the input data is image data, the convolutional layer multiplies the input data by the feature weights of the different filters to obtain image data corresponding to the number of filters. In other words, by using multiple filters in the convolutional layer, it becomes possible to detect patterns in the input data that can capture various features of the input data (image data).

また、プーリング層は、畳み込み層の直後に配置される層である。プーリング層では、レイヤーを縮小することによって、後続処理の実行を容易にすることが可能になり、かつ、畳み込み層において抽出された特徴の位置感度を低下させることが可能になる。 The pooling layer is a layer that is placed immediately after the convolution layer. The pooling layer shrinks the layer, making it easier to perform subsequent processing, and also reducing the position sensitivity of the features extracted in the convolution layer.

その後、ＣＮＮでは、レイヤーＬ３からレイヤーＬ５において全結合した多層パーセプトロンを配置することによって、入力データ（画像データ）の認識を行う（例えば、特許文献１を参照）。 Then, in the CNN, fully connected multilayer perceptrons are arranged in layers L3 to L5 to recognize the input data (image data) (see, for example, Patent Document 1).

特開２０１９－００３４１４号公報JP 2019-003414 A

ここで、上記のようなＣＮＮにおける畳み込み演算（特に、序盤のレイヤーにおいて行われる畳み込み演算）では、膨大な回数の積和演算が行われる。以下、畳み込み演算において行われる積和演算ついて説明を行う。 Here, in the convolution calculations in the above-mentioned CNN (especially the convolution calculations performed in the early layers), a huge number of product-sum calculations are performed. Below, we will explain the product-sum calculations performed in the convolution calculations.

図２は、畳み込み演算において行われる積和演算について説明する図である。図２（Ａ）に示す例における入力データ（ｉｎｐｕｔｆｅａｔｕｒｅｍａｐ）は、データサイズがＮｉｘ×Ｎｉｙであって、チャネル数がＮｉｆである。また、図２（Ａ）に示す例におけるフィルタ（ｆｉｌｔｅｒｓ）は、データサイズがＮｋｘ×Ｎｋｙである。さらに、図２（Ａ）に示す例における出力データ（ｏｕｔｐｕｔｆｅａｔｕｒｅｍａｐ）は、データサイズがＮｏｘ×Ｎｏｙであって、チャネル数がＮｏｆである。 Figure 2 is a diagram explaining the product-sum operation performed in the convolution operation. The input data (input feature map) in the example shown in Figure 2(A) has a data size of Nix x Niy and the number of channels is Nif. The filters (filters) in the example shown in Figure 2(A) have a data size of Nkx x Nky. Furthermore, the output data (output feature map) in the example shown in Figure 2(A) has a data size of Nox x Noy and the number of channels is Nof.

そして、この例における畳み込み演算では、図２（Ｂ）に示すように、積和演算として「ｐｉｘｅｌ_Ｌ（ｎｏ；ｘ,ｙ）＋＝ｐｉｘｅｌ_Ｌ－１（ｎｉ；ｘ＋ｋｘ,ｙ＋ｋｙ）＊ｗｅｉｇｈｔ_Ｌ－１（ｎｉ,ｎｏ；ｋｘ,ｋｙ）」が繰り返し行われる。なお、図２（Ｂ）に示す例において、「ｂｉａｓ（ｎｏ）」がない場合、「ｐｉｘｅｌ_Ｌ（ｎｏ；ｘ,ｙ）」の初期値は０になる。 In the convolution operation in this example, as shown in Fig. 2B, "pixel _L (no; x, y) + = pixel _L-1 (ni; x + kx, y + ky) * weight _L-1 (ni, no; kx, ky)" is repeatedly performed as a multiplication and addition operation. Note that in the example shown in Fig. 2B, if there is no "bias (no)", the initial value of "pixel _L (no; x, y)" will be 0.

ここで、図２（Ｂ）に示す例における積和演算の実行回数は、Ｎｏｆ×Ｎｏｘ×Ｎｏｙ×Ｎｉｆ×Ｎｋｘ×Ｎｋｙ（回）になる。そのため、Ｎｏｆ等の大きさによっては、実行する必要がある積和演算の実行回数が膨大になる。 Here, the number of times the multiply-and-accumulate operation is performed in the example shown in FIG. 2B is Nof×Nox×Noy×Nif×Nkx×Nky (times). Therefore, depending on the size of Nof, etc., the number of times the multiply-and-accumulate operation needs to be performed may become enormous.

したがって、例えば、ＣＮＮをロボットの動作制御や自動車の自動運転等の様々な分野においてより活用させる場合、畳み込み層において行われる畳み込み演算（積和演算）を可能な限り並行に実行し、ＣＮＮにおける処理時間をより高速化させる必要がある。 Therefore, for example, to utilize CNN in various fields such as robot motion control and autonomous driving, it is necessary to perform the convolution operations (product-accumulation operations) in the convolution layer in parallel as much as possible to speed up the processing time in CNN.

そこで、本発明の目的は、畳み込み演算を高速に行うことを可能とするデータ処理装置及びデータ処理方法を提供することにある。 The object of the present invention is to provide a data processing device and a data processing method that enable high-speed convolution operations.

上記目的を達成するための本発明におけるデータ処理装置は、２次元の入力データの特徴量を前記２次元の入力データに対応する複数のフィルタを用いて算出するデータ処理装置であって、２次元に配置された複数のプロセッシングエレメントからなるプロセッシングエレメント群と、前記複数のプロセッシングエレメントごとであって前記複数のフィルタごとの３次元の記憶領域を有する第１記憶部と、を有し、前記プロセッシングエレメント群は、前記２次元の入力データに含まれるデータのそれぞれを、全ての送信先が異なるプロセッシングエレメントになるように前記複数のプロセッシングエレメントのいずれかに送信し、前記複数のフィルタごとに、各フィルタに含まれる１つの特徴重みをそれぞれ取得し、取得した前記複数のフィルタごとの前記１つの特徴重みを、前記複数のプロセッシングエレメントと同じ数だけ複製して前記複数のプロセッシングエレメントと同じ配置にすることによって３次元の特徴重みを生成し、前記複数のプロセッシングエレメントのそれぞれに、取得した前記複数のフィルタごとの前記１つの特徴重みをそれぞれ送信し、前記複数のプロセッシングエレメントのそれぞれに、送信した前記２次元の入力データに含まれるデータと、送信した前記複数のフィルタごとの前記１つの特徴重みとを乗算させることによって３次元の積を算出させ、前記第１記憶部における前記３次元の記憶領域に記憶された値に前記３次元の積を加算し、前記複数のプロセッシングエレメントのそれぞれに対して前記３次元の積の算出にまだ用いられていないデータが送信されるように、前記複数のプロセッシングエレメントのそれぞれに、前記複数のプロセッシングエレメントのそれぞれが保持する前記２次元の入力データに含まれるデータを隣接するプロセッシングエレメントに送信させ、前記加算する処理が行われた回数が前記複数のフィルタのそれぞれに含まれる特徴重みの数に到達するまで、前記取得する処理、前記生成する処理、前記複数のフィルタごとの前記１つの特徴重みをそれぞれ送信する処理、前記算出させる処理、前記加算する処理及び前記２次元の入力データに含まれるデータを隣接するプロセッシングエレメントに送信させる処理を繰り返す、ことを特徴とする。 In order to achieve the above object, the data processing device of the present invention is a data processing device that calculates a feature amount of two-dimensional input data using a plurality of filters corresponding to the two-dimensional input data, and has a processing element group consisting of a plurality of processing elements arranged in two dimensions, and a first storage unit having a three-dimensional storage area for each of the plurality of processing elements and for each of the plurality of filters, and the processing element group transmits each of the data included in the two-dimensional input data to one of the plurality of processing elements such that all of the transmission destinations are different processing elements, and for each of the plurality of filters, obtains one feature weight included in each filter, and generates three-dimensional feature weights by duplicating the one feature weight obtained for each of the plurality of filters the same number as the plurality of processing elements and arranging them in the same arrangement as the plurality of processing elements, and transmits the one feature weight obtained for each of the plurality of filters to each of the plurality of processing elements. the first storage unit, and causes each of the processing elements to multiply the data included in the two-dimensional input data transmitted by the one feature weight for each of the filters transmitted to calculate a three-dimensional product; the first storage unit, and each of the processing elements, transmits the data included in the two-dimensional input data held by each of the processing elements to an adjacent processing element so that data not yet used in the calculation of the three-dimensional product is transmitted to each of the processing elements; and the process of acquiring, generating, transmitting each of the one feature weight for each of the filters, calculating, adding, and transmitting the data included in the two-dimensional input data to an adjacent processing element are repeated until the number of times the adding process has been performed reaches the number of feature weights included in each of the filters.

上記目的を達成するための本発明におけるデータ処理装置は、前記プロセッシングエレメント群は、畳み込みニューラルネットワークの畳み込み層における処理を行う、ことを特徴とする。 To achieve the above object, the data processing device of the present invention is characterized in that the group of processing elements performs processing in the convolutional layer of a convolutional neural network.

上記目的を達成するための本発明におけるデータ処理装置は、前記プロセッシングエレメント群は、前記繰り返す処理の後、前記第１記憶部に記憶された前記３次元の値を出力する、ことを特徴とする。 To achieve the above object, the data processing device of the present invention is characterized in that, after the repeated processing, the processing element group outputs the three-dimensional values stored in the first storage unit.

上記目的を達成するための本発明におけるデータ処理装置は、前記複数のプロセッシングエレメントごとであって前記複数のフィルタごとの３次元の記憶領域を有する第２記憶部を有し、前記プロセッシングエレメント群は、生成した前記３次元の特徴重みを前記第２記憶部に記憶する、ことを特徴とする。 To achieve the above object, the data processing device of the present invention has a second storage unit having a three-dimensional storage area for each of the plurality of processing elements and for each of the plurality of filters, and the processing element group stores the generated three-dimensional feature weights in the second storage unit.

上記目的を達成するための本発明におけるデータ処理装置は、前記複数のプロセッシングエレメントごとの２次元の記憶領域を有する第３記憶部を有し、前記プロセッシングエレメント群は、前記第３記憶部に記憶された前記２次元の入力データに含まれるデータのそれぞれを、全ての送信先が異なるプロセッシングエレメントになるように前記複数のプロセッシングエレメントのいずれかに送信する、ことを特徴とする。 To achieve the above object, the data processing device of the present invention is characterized in that it has a third storage unit having a two-dimensional storage area for each of the plurality of processing elements, and the group of processing elements transmits each of the data contained in the two-dimensional input data stored in the third storage unit to one of the plurality of processing elements such that all of the data are transmitted to different processing elements.

上記目的を達成するための本発明におけるデータ処理装置は、それぞれ異なる前記２次元の入力データとそれぞれ異なる前記複数のフィルタにとに基づく処理を行う複数の前記プロセッシングエレメント群と、前記複数のプロセッシングエレメントごとであって前記複数のフィルタごとの３次元の記憶領域を有する第４記憶部と、を有し、前記複数のプロセッシングエレメント群は、前記繰り返す処理の後、各プロセッシングエレメント群の前記第１記憶部に記憶した前記３次元の値をそれぞれ加算することによって３次元の合計値を算出し、算出した前記３次元の合計値を前記第４記憶部に記憶する、ことを特徴とする。 To achieve the above object, the data processing device of the present invention has a plurality of processing element groups each performing processing based on different two-dimensional input data and different filters, and a fourth storage unit having a three-dimensional storage area for each of the plurality of processing elements and for each of the plurality of filters, and after the repeated processing, the plurality of processing element groups calculate a three-dimensional total value by adding up the three-dimensional values stored in the first storage unit of each processing element group, and store the calculated three-dimensional total value in the fourth storage unit.

上記目的を達成するための本発明におけるデータ処理装置は、前記複数のプロセッシングエレメント群は、前記３次元の合計値を算出する処理の後、前記第４記憶部に記憶された前記３次元の合計値を出力する、ことを特徴とする。 To achieve the above object, the data processing device of the present invention is characterized in that the group of multiple processing elements outputs the three-dimensional sum stored in the fourth storage unit after the process of calculating the three-dimensional sum.

また、上記目的を達成するための本発明におけるデータ処理方法は、２次元の入力データの特徴量を前記２次元の入力データに対応する複数のフィルタを用いて算出するデータ処理装置におけるデータ処理方法であって、２次元に配置された複数のプロセッシングエレメントからなるプロセッシングエレメント群と、前記複数のプロセッシングエレメントごとであって前記複数のフィルタごとの３次元の記憶領域を有する第１記憶部と、を有し、前記プロセッシングエレメント群は、前記２次元の入力データに含まれるデータのそれぞれを、全ての送信先が異なるプロセッシングエレメントになるように前記複数のプロセッシングエレメントのいずれかに送信し、前記複数のフィルタごとに、各フィルタに含まれる１つの特徴重みをそれぞれ取得し、取得した前記複数のフィルタごとの前記１つの特徴重みを、前記複数のプロセッシングエレメントと同じ数だけ複製して前記複数のプロセッシングエレメントと同じ配置にすることによって３次元の特徴重みを生成し、前記複数のプロセッシングエレメントのそれぞれに、取得した前記複数のフィルタごとの前記１つの特徴重みをそれぞれ送信し、前記複数のプロセッシングエレメントのそれぞれに、送信した前記２次元の入力データに含まれるデータと、送信した前記複数のフィルタごとの前記１つの特徴重みとを乗算させることによって３次元の積を算出させ、前記第１記憶部における前記３次元の記憶領域に記憶された値に前記３次元の積を加算し、前記複数のプロセッシングエレメントのそれぞれに対して前記３次元の積の算出にまだ用いられていないデータが送信されるように、前記複数のプロセッシングエレメントのそれぞれに、前記複数のプロセッシングエレメントのそれぞれが保持する前記２次元の入力データに含まれるデータを隣接するプロセッシングエレメントに送信させ、前記加算する処理が行われた回数が前記複数のフィルタのそれぞれに含まれる特徴重みの数に到達するまで、前記取得する処理、前記生成する処理、前記複数のフィルタごとの前記１つの特徴重みをそれぞれ送信する処理、前記算出させる処理、前記加算する処理及び前記２次元の入力データに含まれるデータを隣接するプロセッシングエレメントに送信させる処理を繰り返す、ことを特徴とする。 In addition, the data processing method of the present invention for achieving the above object is a data processing method in a data processing device that calculates a feature amount of two-dimensional input data using a plurality of filters corresponding to the two-dimensional input data, the data processing method comprising: a processing element group consisting of a plurality of processing elements arranged two-dimensionally; and a first storage unit having a three-dimensional storage area for each of the plurality of processing elements and for each of the plurality of filters, the processing element group transmits each of the data included in the two-dimensional input data to one of the plurality of processing elements such that all of the transmission destinations are different processing elements, obtains one feature weight included in each filter for each of the plurality of filters, copies the obtained one feature weight for each of the plurality of filters by the same number as the plurality of processing elements and arranges them in the same manner as the plurality of processing elements to generate three-dimensional feature weights, and transmits the obtained one feature weight for each of the plurality of filters to each of the plurality of processing elements. the first storage unit, and causes each of the processing elements to calculate a three-dimensional product by multiplying the data included in the two-dimensional input data transmitted by the one feature weight for each of the multiple filters transmitted; the first storage unit, and the processing element, each of the processing elements, transmits the data included in the two-dimensional input data held by each of the multiple processing elements to an adjacent processing element so that data not yet used in the calculation of the three-dimensional product is transmitted to each of the multiple processing elements; and the process of acquiring, generating, transmitting the one feature weight for each of the multiple filters, calculating, adding, and transmitting the data included in the two-dimensional input data to an adjacent processing element is repeated until the number of times the adding process has been performed reaches the number of feature weights included in each of the multiple filters.

本発明におけるデータ処理装置及びデータ処理方法によれば、畳み込み層において行われる畳み込み演算を並行に実行することが可能になるため、畳み込み演算を高速に行うことが可能になる。 The data processing device and data processing method of the present invention make it possible to execute the convolution operations in the convolution layer in parallel, thereby enabling the convolution operations to be performed at high speed.

図１は、ＣＮＮのシステム構成の概略について説明する図である。FIG. 1 is a diagram for explaining an outline of the system configuration of a CNN. 図２は、畳み込み演算において行われる積和演算について説明する図である。FIG. 2 is a diagram for explaining a multiply-and-accumulate operation performed in a convolution operation. 図３は、本発明の実施の形態におけるデータ処理装置１０の構成例を示す図である。FIG. 3 is a diagram showing an example of a configuration of a data processing device 10 according to an embodiment of the present invention. 図４は、本発明の実施の形態におけるＴＰＥｎの構成例を示す図である。FIG. 4 is a diagram showing an example of the configuration of a TPEn according to an embodiment of the present invention. 図５は、ＴＰＥｎにおける処理の詳細のフローチャート図である。FIG. 5 is a flow chart showing details of the process in TPEn. 図６は、ＴＰＥｎにおける処理の詳細のフローチャート図である。FIG. 6 is a flow chart showing details of the process in TPEn. 図７は、ＴＰＥｎにおける処理の詳細のフローチャート図である。FIG. 7 is a flow chart showing details of the process in TPEn. 図８は、ＴＰＥｎにおける処理の詳細のフローチャート図である。FIG. 8 is a flow chart showing details of the process in TPEn. 図９は、Ｓ１６からＳ２２の処理の具体例について説明する図である。FIG. 9 is a diagram for explaining a specific example of the processes from S16 to S22. 図１０は、Ｓ１６からＳ２２の処理の具体例について説明する図である。FIG. 10 is a diagram for explaining a specific example of the processes from S16 to S22. 図１１は、Ｓ１６からＳ２２の処理の具体例について説明する図である。FIG. 11 is a diagram for explaining a specific example of the processes from S16 to S22. 図１２は、Ｓ１６からＳ２２の処理の具体例について説明する図である。FIG. 12 is a diagram for explaining a specific example of the processes from S16 to S22. 図１３は、Ｓ１６からＳ２２の処理の具体例について説明する図である。FIG. 13 is a diagram for explaining a specific example of the processes from S16 to S22. 図１４は、Ｓ１６からＳ２２の処理の具体例について説明する図である。FIG. 14 is a diagram for explaining a specific example of the processes from S16 to S22. 図１５は、Ｓ１６からＳ２２の処理の具体例について説明する図である。FIG. 15 is a diagram for explaining a specific example of the processes from S16 to S22. 図１６は、畳み込み演算において行われる積和演算の並列度について説明する図である。FIG. 16 is a diagram for explaining the parallelism of the multiply-and-accumulate operation performed in the convolution operation.

以下、図面を参照して本発明の実施の形態について説明する。しかしながら、かかる実施の形態例が、本発明の技術的範囲を限定するものではない。 Below, an embodiment of the present invention will be described with reference to the drawings. However, such an embodiment does not limit the technical scope of the present invention.

［データ処理装置の構成］
初めに、データ処理装置１０の構成について説明を行う。図３は、本発明の実施の形態におけるデータ処理装置１０の構成例を示す図である。 [Configuration of data processing device]
First, a description will be given of the configuration of the data processing device 10. Fig. 3 is a diagram showing an example of the configuration of the data processing device 10 according to the embodiment of the present invention.

データ処理装置１０は、図３に示す例において、基盤１２と、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１３とを有する。そして、基盤１２は、図３に示す例において、チップ１４と、オフチップメモリ１５（以下、単にメモリ１５とも呼ぶ）とを有する。すなわち、データ処理装置１０では、例えば、ＣＰＵ１３とチップ１４とが各種処理を分担して実行する。 In the example shown in FIG. 3, the data processing device 10 has a board 12 and a CPU (Central Processing Unit) 13. In the example shown in FIG. 3, the board 12 has a chip 14 and off-chip memory 15 (hereinafter also simply referred to as memory 15). That is, in the data processing device 10, for example, the CPU 13 and the chip 14 share and execute various processes.

なお、基盤１２は、例えば、ＦＰＧＡ（Ｆｉｅｌｄ－ＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）ボードであってよい。また、チップ１４は、例えば、ＦＰＧＡチップやＡＳＩＣ（ＡｐｐｌｉｃａｔｉｏｎＳｐｅｃｉｆｉｃＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）チップであってよい。 The board 12 may be, for example, a Field-Programmable Gate Array (FPGA) board. The chip 14 may be, for example, an FPGA chip or an Application Specific Integrated Circuit (ASIC) chip.

さらに、チップ１４は、図３に示す例において、コントローラ１７と、オンチップメモリ１６（以下、単にメモリ１６とも呼ぶ）と、ＴＰＥ（ＴｅｎｓｏｒＰｒｏｃｅｓｓｉｎｇＥｌｅｍｅｎｔ）１からＴＰＥｍまでのｍ個のＴＰＥを含むプロセッサであるＳＴＰ（ＳｏｆｔＴｅｎｓｏｒＰｒｏｃｅｓｓｏｒ）２０とを有する。なお、以下、ＴＰＥ１からＴＰＥｍを総称して単にＴＰＥとも呼ぶ。 Furthermore, in the example shown in FIG. 3, the chip 14 has a controller 17, an on-chip memory 16 (hereinafter also simply referred to as memory 16), and a soft tensor processor (STP) 20 which is a processor including m TPEs (tensor processing elements) TPE1 to TPEm. Note that hereinafter, TPE1 to TPEm are collectively referred to simply as TPEs.

図３に示す例において、メモリ１６には、畳み込み演算の対象である３次元の入力データＸが記憶される情報格納領域であるバッファ２１と、畳み込み演算に用いられるフィルタが記憶される情報格納領域であるバッファ２２と、畳み込み演算の演算結果が記憶される情報格納領域であるバッファ２３（以下、第４記憶部とも呼ぶ）とが含まれる。 In the example shown in FIG. 3, memory 16 includes buffer 21, which is an information storage area in which three-dimensional input data X that is the subject of the convolution operation is stored, buffer 22, which is an information storage area in which a filter used in the convolution operation is stored, and buffer 23 (hereinafter also referred to as the fourth storage unit), which is an information storage area in which the results of the convolution operation are stored.

また、図３に示す例において、コントローラ１７は、例えば、バッファ２１、バッファ２２及びバッファ２３に格納された各種情報をＴＰＥのそれぞれに送信することによって、ＴＰＥのそれぞれに畳み込み演算を行わせる処理を行う。 In the example shown in FIG. 3, the controller 17 performs processing to cause each of the TPEs to perform a convolution operation, for example, by transmitting various pieces of information stored in buffer 21, buffer 22, and buffer 23 to each of the TPEs.

［ＴＰＥの構成］
次に、ＴＰＥの構成について説明を行う。図４は、本実施の形態におけるＴＰＥｎの構成例を示す図である。なお、ＴＰＥｎ以外の他のＴＰＥについては、ＴＰＥｎと同様の構成であるため説明を省略する。 [Configuration of TPE]
Next, the configuration of the TPE will be described. Fig. 4 is a diagram showing an example of the configuration of the TPEn in this embodiment. Note that the other TPEs than the TPEn have the same configuration as the TPEn, so the description will be omitted.

図４に示す例において、ＴＰＥｎは、それぞれ畳み込み演算（積和演算）を行うＮｏｘ×Ｎｏｙ個の演算器Ｕと、演算器Ｕにおける畳み込み演算の演算結果が記憶されるレジスタＲ１，１からＲ１，Ｎｏｆのそれぞれ（以下、これらを総称してレジスタＲ１または第１記憶部とも呼ぶ）と、バッファ２２に記憶されていたフィルタの一部を格納するレジスタＲ２，１からレジスタＲ２，Ｎｏｆのそれぞれ（以下、これらを総称してレジスタＲ２または第２記憶部とも呼ぶ）と、バッファ２１に格納されていた入力データＸの一部を記憶するレジスタＲ３（以下、第３記憶部とも呼ぶ）と、を有する。 In the example shown in FIG. 4, TPEn has Nox×Noy arithmetic units U each performing a convolution operation (multiply-and-accumulate operation), registers R1,1 to R1,Nof (hereinafter collectively referred to as register R1 or first storage unit) in which the results of the convolution operation in arithmetic unit U are stored, registers R2,1 to R2,Nof (hereinafter collectively referred to as register R2 or second storage unit) in which a portion of the filter stored in buffer 22 is stored, and register R3 (hereinafter referred to as third storage unit) in which a portion of the input data X stored in buffer 21 is stored.

具体的に、コントローラ１７は、図４に示すように、バッファ２２に記憶されたＮｏｆ種類のフィルタのうち、１種類目のフィルタに含まれる１つ目の特徴重みω_１ ^（１）を取得し、取得した特徴重みω_１ ^（１）をＮｏｘ×Ｎｏｙ個（演算器Ｕと同じ数）に複製して配置することによって２次元の特徴重みω_１ ^（１）を生成し、生成した２次元の特徴重みω_１ ^（１）をレジスタＲ２，１に記憶する。 Specifically, as shown in FIG. 4 , the controller 17 acquires the first feature weight ω ₁ ^{(1) included in the first type of filter among the Nof types of filters stored in the buffer 22, generates a two-dimensional feature weight ω 1 (1)} by duplicating and arranging the acquired feature weight ω ₁ ⁽¹⁾ Nox × Noy times (the same number as the calculator U), and stores the generated two-dimensional feature weight _{ω 1} ₍ ¹ ⁾ in the register R2,1.

また、コントローラ１７は、図４に示すように、バッファ２２に記憶されたＮｏｆ種類のフィルタのうち、２種類目のフィルタに含まれる１つ目の特徴重みω_２ ^（１）を取得し、取得した特徴重みω_２ ^（１）をＮｏｘ×Ｎｏｙ個に複製して配置することによって２次元の特徴重みω_２ ^（１）を生成し、生成した２次元の特徴重みω_２ ^（１）をレジスタＲ２，２に記憶する。 Moreover, as shown in FIG. 4 , the controller 17 acquires the first feature weight ω ₂ ^{(1) included in the second type filter among the Nof types of filters stored in the buffer 22, generates a two-dimensional feature weight ω 2 (1)} by duplicating and arranging the acquired feature weight _{ω 2} ₍ ¹⁾ ^Nox ×Noy times, and stores the generated two-dimensional feature weight ω ₂ ⁽¹⁾ in the register R2,2.

さらに、コントローラ１７は、他の種類のフィルタに含まれる１つ目の特徴重みについても同様に、２次元の特徴重みを生成してレジスタＲ２に記憶する。 Furthermore, the controller 17 similarly generates two-dimensional feature weights for the first feature weights included in the other types of filters and stores them in register R2.

すなわち、コントローラ１７は、Ｎｏｆ種類のフィルタのそれぞれに対応する２次元の特徴重み（以下、これらを纏めて３次元の特徴重みΩとも呼ぶ）を生成してレジスタＲ２に記憶する。 That is, the controller 17 generates two-dimensional feature weights (hereinafter, these are collectively referred to as three-dimensional feature weights Ω) corresponding to each of the Nof types of filters and stores them in the register R2.

また、コントローラ１７は、図４に示すように、バッファ２１に記憶された３次元の入力データＸのうち、ＴＰＥｎに対応する２次元の入力データＸをレジスタＲ３に記憶する。 In addition, as shown in FIG. 4, the controller 17 stores in register R3 the two-dimensional input data X corresponding to TPEn among the three-dimensional input data X stored in the buffer 21.

続いて、Ｎｏｘ×Ｎｏｙ個の演算器Ｕは、レジスタＲ２に記憶された３次元の特徴重みΩを分解することによってＮｏｆ個の２次元の特徴重みΩを生成する。そして、Ｎｏｘ×Ｎｏｙ個の演算器Ｕは、レジスタＲ３に記憶された２次元の入力データＸと、分解することによって生成されたＮｏｆ個の２次元の特徴重みΩのそれぞれとを乗算することによって、Ｎｏｆ個の２次元の積（以下、これらを纏めて３次元の積とも呼ぶ）を算出する。 Then, the Nox×Noy calculators U generate Nof two-dimensional feature weights Ω by decomposing the three-dimensional feature weights Ω stored in the register R2. Then, the Nox×Noy calculators U multiply the two-dimensional input data X stored in the register R3 by each of the Nof two-dimensional feature weights Ω generated by the decomposition, thereby calculating Nof two-dimensional products (hereinafter, these are also collectively referred to as three-dimensional products).

すなわち、Ｎｏｘ×Ｎｏｙ個の演算器Ｕのそれぞれは、レジスタＲ３に記憶された２次元の入力データＸのうち、各演算器Ｕに対応するデータと、分解することによって生成されたＮｏｆ個の２次元の特徴重みΩのそれぞれのうち、各演算器Ｕに対応する特徴重みとを乗算することによって、１×１×Ｎｏｆ個の積を算出する。そのため、Ｎｏｘ×Ｎｏｙ個の演算器Ｕは、全体として、Ｎｏｘ×Ｎｏｙ×Ｎｏｆ個の３次元の積の算出を行う。 That is, each of the Nox×Noy calculators U calculates 1×1×Nof products by multiplying the data corresponding to each calculator U among the two-dimensional input data X stored in register R3 by the feature weight corresponding to each calculator U among the Nof two-dimensional feature weights Ω generated by decomposition. Therefore, the Nox×Noy calculators U as a whole calculate Nox×Noy×Nof three-dimensional products.

そして、Ｎｏｘ×Ｎｏｙ個の演算器Ｕは、算出した３次元の積を、レジスタＲ１に記憶された３次元のデータＺ（既に算出した３次元の積の総和）に加算する。 Then, the Nox x Noy calculators U add the calculated three-dimensional products to the three-dimensional data Z stored in register R1 (the sum of the three-dimensional products already calculated).

その後、Ｎｏｘ×Ｎｏｙ個の演算器Ｕは、バッファ２２に記憶された各フィルタに含まれる特徴重みの数に対応する回数（Ｋ^２回）だけ、上記の処理を繰り返し行う。以下、ＴＰＥｎにおける処理の詳細について説明を行う。 Thereafter, the Nox×Noy computing units U repeat the above process a number of times (K ² times) corresponding to the number of feature weights included in each filter stored in the buffer 22. Details of the process in TPEn will be described below.

［ＴＰＥにおける処理の詳細のフローチャート］
図５から図８は、ＴＰＥｎにおける処理の詳細のフローチャート図である。 [Flowchart showing details of processing in TPE]
5 to 8 are flow charts showing details of the processing in TPEn.

コントローラ１７は、図５に示すように、３次元の入力データＸに含まれる２次元の入力データＸのそれぞれを識別する変数であるｉにｎを設定する（Ｓ０１）。すなわち、ＴＰＥｎがＴＰＥ１である場合、コントローラ１７は、ｉに１を設定する。また、コントローラ１７は、各フィルタに含まれる特徴重みを識別する変数であるｋに１を設定する（Ｓ０２）。 As shown in FIG. 5, the controller 17 sets n to i, which is a variable that identifies each of the two-dimensional input data X included in the three-dimensional input data X (S01). That is, when TPEn is TPE1, the controller 17 sets i to 1. The controller 17 also sets k, which is a variable that identifies the feature weight included in each filter, to 1 (S02).

そして、コントローラ１７は、ｎが１である場合（Ｓ０３のＹＥＳ）、バッファ２２からバイアスに対応する３次元のデータを取得してレジスタＲ１に記憶する（Ｓ０４）。 If n is 1 (YES in S03), the controller 17 obtains three-dimensional data corresponding to the bias from the buffer 22 and stores it in the register R1 (S04).

一方、コントローラ１７は、ｎが１でない場合（Ｓ０３のＮＯ）、全ての要素が０であってデータサイズがＮｏｘ×Ｎｏｙ×Ｎｏｆである３次元のデータＺをレジスタＲ１に記憶する（Ｓ０５）。すなわち、ＴＰＥｎが１である場合に限り、バイアスが加算されるように積和演算を行う。 On the other hand, if n is not 1 (NO in S03), the controller 17 stores three-dimensional data Z, all of whose elements are 0 and whose data size is Nox x Noy x Nof, in the register R1 (S05). In other words, only when TPEn is 1, a multiply-and-accumulate operation is performed so that a bias is added.

そして、コントローラ１７は、バッファ２１に記憶された３次元の入力データＸのうち、第ｉチャネルに対応する２次元のデータＸを取得してレジスタＲ３に記憶する（Ｓ０６）。 Then, the controller 17 acquires the two-dimensional data X corresponding to the i-th channel from the three-dimensional input data X stored in the buffer 21 and stores it in the register R3 (S06).

続いて、コントローラ１７は、図６に示すように、バッファ２２に記憶された各フィルタの第ｉチャネルに含まれる特徴重みのうち、ｋに対応する１×１×Ｎｏｆのデータサイズの特徴重みを取得する（Ｓ１１）。 Next, as shown in FIG. 6, the controller 17 acquires the feature weight of data size 1×1×Nof corresponding to k from among the feature weights included in the i-th channel of each filter stored in the buffer 22 (S11).

そして、コントローラ１７は、Ｓ１１の処理で取得した各特徴重みをＮｏｘ×Ｎｏｙ個に複製することによってＮｏｘ×Ｎｏｙ×Ｎｏｆの３次元の特徴重みΩを生成してレジスタＲ２に格納する（Ｓ１２）。 Then, the controller 17 generates a three-dimensional feature weight Ω of Nox x Noy x Nof by duplicating each feature weight obtained in the process of S11 into Nox x Noy copies, and stores the resulting weight in the register R2 (S12).

その後、Ｎｏｘ×Ｎｏｙ個の演算器Ｕは、レジスタＲ１に記憶された３次元のデータＺを取得する（Ｓ１３）。また、Ｎｏｘ×Ｎｏｙ個の演算器Ｕは、レジスタＲ２に記憶された３次元のデータΩを取得する（Ｓ１４）。さらに、Ｎｏｘ×Ｎｏｙ個の演算器Ｕは、レジスタＲ３に記憶された２次元のデータＸを取得する（Ｓ１５）。 Then, the Nox×Noy calculators U acquire the three-dimensional data Z stored in the register R1 (S13). The Nox×Noy calculators U also acquire the three-dimensional data Ω stored in the register R2 (S14). Furthermore, the Nox×Noy calculators U acquire the two-dimensional data X stored in the register R3 (S15).

続いて、Ｎｏｘ×Ｎｏｙ個の演算器Ｕは、Ｓ１４の処理で取得したＮｏｘ×Ｎｏｙ×Ｎｏｆの３次元のデータΩをＮｏｆ個の２次元のデータ（データサイズがＮｏｘ×Ｎｏｙであるデータ）に分解し、分解した各２次元のデータとＳ１５で取得した２次元のデータＸとの３次元の積に対し、Ｓ１３の処理で取得した３次元のデータＺを加算する（Ｓ１６）。 Next, the Nox x Noy calculators U decompose the Nox x Noy x Nof three-dimensional data Ω obtained in the process of S14 into Nof two-dimensional data (data with a data size of Nox x Noy), and add the three-dimensional data Z obtained in the process of S13 to the three-dimensional product of each of the decomposed two-dimensional data and the two-dimensional data X obtained in S15 (S16).

すなわち、各演算器Ｕは、３次元データΩから分解された２次元のデータに含まれる１×１のデータサイズのデータと、データＸに含まれる１×１のデータサイズのデータとの積に対し、３次元のデータＺのうちの対応するデータを加算する。 That is, each calculator U adds the corresponding data in the three-dimensional data Z to the product of the 1x1 data size data contained in the two-dimensional data decomposed from the three-dimensional data Ω and the 1x1 data size data contained in the data X.

そして、コントローラ１７は、図７に示すように、レジスタＲ１に記憶された３次元のデータＺを、Ｓ１６の処理で算出した３次元のデータに更新する（Ｓ２１）。 Then, as shown in FIG. 7, the controller 17 updates the three-dimensional data Z stored in the register R1 to the three-dimensional data calculated in the process of S16 (S21).

次に、コントローラ１７は、レジスタＲ３に記憶された２次元のデータＸを所定の方法によって与えられた方向にシフトする（Ｓ２２）。具体的に、コントローラ１７は、２次元のデータＸに含まれる各データが各フィルタに含まれる全ての特徴重みを巡回するように、２次元のデータＸのシフト方向を決定する。Ｓ２２の処理の具体例については後述する。 Next, the controller 17 shifts the two-dimensional data X stored in the register R3 in a direction given by a predetermined method (S22). Specifically, the controller 17 determines the shift direction of the two-dimensional data X so that each data included in the two-dimensional data X circulates through all the feature weights included in each filter. A specific example of the process of S22 will be described later.

そして、コントローラ１７は、ｋに１を加算する（Ｓ２３）。その結果、ｋがＫ^２以上でない場合（Ｓ２４のＮＯ）、コントローラ１７は、Ｓ０６以降の処理を再度行う。 Then, the controller 17 adds 1 to k (S23). As a result, if k is not equal to or greater than ^K2 (NO in S24), the controller 17 repeats the process from S06 onwards.

一方、ｋがＫ^２以上である場合（Ｓ２４のＹＥＳ）、コントローラ１７は、ｉにｍを加算する（Ｓ２５）。その結果、ｉがＮｉｆ以上でない場合（Ｓ２６のＮＯ）、コントローラ１７は、Ｓ０６以降の処理を再度行う。すなわち、バッファ２１に記憶された３次元の入力データＸの全てが処理済でないと判定した場合、コントローラ１７は、Ｓ０６以降の処理を再度行う。以下、Ｓ１６からＳ２２の処理の具体例について説明を行う。 On the other hand, if k is equal to or greater than ^K2 (YES in S24), the controller 17 adds m to i (S25). As a result, if i is not equal to or greater than Nif (NO in S26), the controller 17 repeats the process from S06 onwards. That is, if it is determined that all of the three-dimensional input data X stored in the buffer 21 has not been processed, the controller 17 repeats the process from S06 onwards. A specific example of the process from S16 to S22 will be described below.

［Ｓ１６からＳ２２の処理の具体例］
図９から図１５は、Ｓ１６からＳ２２の処理の具体例について説明する図である。以下、図９に示すように、ＴＰＥｎに４×４個の演算器Ｕ（演算器Ｕ_１，１、Ｕ_１，２、Ｕ_１，３、Ｕ_１，４、Ｕ_２，１、Ｕ_２，２、Ｕ_２，３、Ｕ_２，４、Ｕ_３，１、Ｕ_３，２、Ｕ_３，３、Ｕ_３，４、Ｕ_４，１、Ｕ_４，２、Ｕ_４，３及びＵ_４，４）が搭載されているものとして説明を行う。また、以下、各フィルタに含まれる特徴重みの数（Ｋ^２）が９であるものとして説明を行う。 [Specific example of processing from S16 to S22]
9 to 15 are diagrams for explaining a specific example of the process from S16 to S22. In the following, as shown in FIG. 9, the TPEn is assumed to have 4×4 calculators U (calculator U _1,1 , U _1,2 , U _1,3 , U _1,4 , U _2,1 , U _2,2 , U _2,3 , U _2,4 , U _3,1 , U _3,2 , U _3,3 , U _3,4 , U _4,1 , U _4,2 , U _4,3 and U _4,4 ). In the following, the number of feature weights (K ² ) included in each filter is assumed to be 9.

（ｋが１の場合の処理）
初めに、ｋが１の場合の処理について説明を行う。 (Processing when k is 1)
First, the process when k is 1 will be described.

この場合、コントローラ１７は、図１０（Ａ）に示すように、分解後の２次元の特徴重みω_１を演算器Ｕ_１，１から演算器Ｕ_４，４のそれぞれに送信（ブロードキャスト送信）する。また、コントローラ１７は、図１０（Ａ）に示すように、レジスタＲ３に格納された２次元の入力データＸに含まれるデータＸ_１，１からデータＸ_４，４を演算器Ｕ_１，１から演算器Ｕ_４，４のそれぞれに送信する。 In this case, as shown in Fig. 10A, the controller 17 transmits (broadcasts) the decomposed two-dimensional feature weight _ω1 from the calculator _U1,1 to each of the calculators _U4,4 . Also, as shown in Fig. 10A, the controller 17 transmits data _X1,1 to _X4,4 included in the two-dimensional input data X stored in the register R3 from the calculator _U1,1 to each of the calculators _U4,4 .

その後、演算器Ｕ_１，１から演算器Ｕ_４，４のそれぞれは、特徴重みω_１とデータＸとを乗算する。具体的に、例えば、演算器Ｕ_２，２は、図１０（Ａ）に示すように、特徴重みω_１とデータＸ_２，２との積を算出する。 Thereafter, each of the calculators _U1,1 to _U4,4 multiplies the feature weight _ω1 by the data X. Specifically, for example, the calculator _U2,2 calculates the product of the feature weight _ω1 and the data _X2,2 as shown in FIG.

続いて、演算器Ｕ_１，１から演算器Ｕ_４，４のそれぞれは、レジスタＲ３に格納された３次元のデータＺのうち、各演算器Ｕに対応するデータに、算出した積を加算する。具体的に、例えば、演算器Ｕ_２，２は、図１０（Ｂ）のＳｔｅｐ１に示すように、レジスタＲ３に格納された３次元のデータＺのうち、演算器Ｕ_２，２に対応する値（ｋが１である場合は０）に、特徴重みω_１とデータＸ_２，２との積を加算する。 Next, each of the calculators _U1,1 to _U4,4 adds the calculated product to the data corresponding to each calculator U among the three-dimensional data Z stored in the register R3. Specifically, for example, the calculator _U2,2 adds the product of the feature weight ω1 and the data X2,2 to the value corresponding to the calculator _U2,2 (0 when k is 1) among the three-dimensional data Z stored in the register R3, as shown in Step ₁ of FIG. _10B .

そして、演算器Ｕ_１，１から演算器Ｕ_４，４は、それぞれが保持しているデータＸを隣接している演算器Ｕに送信する。具体的に、例えば、演算器Ｕ_２，２は、図１１（Ａ）に示すように、データＸ_２，２を演算器Ｕ_２，１に送信する。 Then, the calculators _U1,1 to _U4,4 transmit the data X that they hold to the adjacent calculator U. Specifically, for example, the calculator _U2,2 transmits the data _X2,2 to the calculator _U2,1 as shown in FIG.

ここで、データＸの送信方向に他の演算器Ｕが存在しない場合、各演算器Ｕは、図１０（Ａ）等に示すトラースネットワークを用いることによって、それぞれが保持しているデータＸを反対側の演算器Ｕに送信するものであってよい。具体的に、図１０（Ａ）に示す例において、演算器Ｕ_１，１、演算器Ｕ_２，１、演算器Ｕ_３，１及び演算器Ｕ_４，１（左端に位置する演算器Ｕ）は、それぞれが保持しているデータを演算器Ｕ_１，４、演算器Ｕ_２，４、演算器Ｕ_３，４及び演算器Ｕ_４，４（右端に位置する演算器Ｕ）のそれぞれに送信する。 Here, when there is no other computing unit U in the transmission direction of the data X, each computing unit U may transmit the data X held by it to the computing unit U on the opposite side by using the trace network shown in Fig. 10(A) etc. Specifically, in the example shown in Fig. 10(A), the computing units _U1,1 , _U2,1 , _U3,1 and _U4,1 (the computing units U located at the left end) transmit the data held by them to the computing units _U1,4 , _U2,4 , _U3,4 and _U4,4 (the computing units U located at the right end), respectively.

これにより、各演算器Ｕは、データＸの送信方向に他の演算器Ｕが存在しない場合であっても、それぞれが保持しているデータＸがシフトによって失われることを防止することが可能になる。 This allows each computing unit U to prevent the data X held by each computing unit U from being lost due to a shift, even if there are no other computing units U in the transmission direction of the data X.

なお、演算器Ｕ_１，１、演算器Ｕ_２，１、演算器Ｕ_３，１及び演算器Ｕ_４，１は、この場合、トラースネットワークを経由して反対側の演算器Ｕに送信されたデータＸが反対側の演算器Ｕにおいて積和計算に用いられることを防止する必要がある。そのため、演算器Ｕ_１，１、演算器Ｕ_２，１、演算器Ｕ_３，１及び演算器Ｕ_４，１のそれぞれは、例えば、トラースネットワークを経由して反対側の演算器Ｕから送信されたデータであることを示すフラグ（積和計算に用いられないデータであることを示すフラグ）を付加した上で、演算器Ｕ_１，４、演算器Ｕ_２，４、演算器Ｕ_３，４及び演算器Ｕ_４，４に対してデータＸの送信を行う。 In this case, the calculators _U1,1 , _U2,1 , _U3,1 , and _U4,1 need to prevent the data X transmitted to the calculator U on the opposite side via the trace network from being used for product-sum calculation in the calculator U on the opposite side. For this reason, each of the calculators _U1,1 , _U2,1 , _U3,1 , and _U4,1 transmits the data X to the calculators _{U1,4, U2,4} , U3,4, and U4,4 after adding a flag indicating that the data is transmitted from the calculator U on the opposite side via the trace _network ( _a flag indicating that the data is not used for product _- sum calculation).

また、ＴＰＥｎは、この場合、演算器Ｕ_１，１等の４×４個の演算器Ｕの外側に配置されたパティング用の演算器Ｕをさらに有するものであってもよい。そして、図１０（Ａ）に示す例において、演算器Ｕ_１，１、演算器Ｕ_２，１、演算器Ｕ_３，１及び演算器Ｕ_４，１は、それぞれが保持しているデータを、各演算器Ｕの左側に位置するパティング用の演算器Ｕに送信するものであってよい。 In this case, TPEn may further include a padding calculator U arranged outside the 4×4 calculators U such as calculator _U1,1 . In the example shown in Fig. 10A, calculators _U1,1 , _U2,1 , _U3,1 , and _U4,1 may transmit data held by each of them to a padding calculator U arranged to the left of each of the calculators U.

なお、この場合、ＴＰＥｎが有する演算器Ｕ（パティング用の演算器Ｕを含む）の数は、以下の式（１）によって算出される。 In this case, the number of operators U (including the operator U for padding) that TPEn has is calculated using the following formula (1).

（ｋが２の場合の処理）
次に、ｋが２の場合の処理について説明を行う。

(Processing when k is 2)
Next, the process when k is 2 will be described.

この場合、コントローラ１７は、図１１（Ａ）に示すように、分解後の２次元の特徴重みω_２を演算器Ｕ_１，１から演算器Ｕ_４，４のそれぞれに送信（ブロードキャスト送信）する。そして、演算器Ｕ_１，１から演算器Ｕ_４，４のそれぞれは、データＸを保持している場合（直前に行われたＳ１２の処理において隣接する演算器ＵからデータＸを受信している場合）、特徴重みω_２とデータＸとを乗算する。具体的に、例えば、演算器Ｕ_２，２は、図１１（Ａ）に示すように、特徴重みω_２とデータＸ_２，３（演算器Ｕ_２，２から送信されたデータＸ）との積を算出する。 In this case, the controller 17 transmits (broadcasts) the two-dimensional feature weight _ω2 after decomposition to each of the calculators _U1,1 to _U4,4 as shown in Fig. 11A. Then, when each of the calculators _U1,1 to _U4,4 holds data X (when data X has been received from an adjacent calculator U in the immediately preceding process of S12), it multiplies the feature weight _ω2 by the data X. Specifically, for example, the calculator _U2,2 calculates the product of the feature weight _ω2 and data _X2,3 (data X transmitted from the calculator _U2,2 ) as shown in Fig. 11A.

続いて、演算器Ｕ_１，１から演算器Ｕ_４，４のそれぞれは、レジスタＲ３に格納された３次元のデータＺのうち、各演算器Ｕに対応する値に、算出した積を加算する。具体的に、例えば、演算器Ｕ_２，２は、図１１（Ｂ）のＳｔｅｐ２に示すように、レジスタＲ３に格納された３次元のデータＺのうち、演算器Ｕ_２，２に対応する値に、特徴重みω_２とデータＸ_２，３との積を加算する。 Next, each of the calculators _U1,1 to _U4,4 adds the calculated product to a value of the three-dimensional data Z stored in the register R3 that corresponds to each calculator U. Specifically, for example, the calculator _U2,2 adds the product of the feature weight ω2 and the data X2,3 to a value of the three-dimensional data Z stored in the register R3 that corresponds to the calculator _U2,2 , as shown in Step ₂ _of FIG.

そして、演算器Ｕ_１，１から演算器Ｕ_４，４は、保持しているデータＸを隣接している演算器Ｕに送信する。具体的に、例えば、演算器Ｕ_２，２は、図１１（Ｂ）に示すように、データＸ_２，３を演算器Ｕ_３，２に送信する。 Then, the calculators _U1,1 to _U4,4 transmit the data X they hold to the adjacent calculators U. Specifically, for example, the calculator _U2,2 transmits data _X2,3 to the calculator _U3,2 as shown in FIG.

（ｋが３の場合の処理）
次に、ｋが３の場合の処理について説明を行う。 (Processing when k is 3)
Next, the process when k is 3 will be described.

コントローラ１７は、この場合、図１２（Ａ）に示すように、分解後の２次元の特徴重みω_３を演算器Ｕ_１，１から演算器Ｕ_４，４のそれぞれに送信（ブロードキャスト送信）する。そして、演算器Ｕ_１，１から演算器Ｕ_４，４のそれぞれは、データＸを保持している場合、特徴重みω_３とデータＸとを乗算する。具体的に、例えば、演算器Ｕ_２，２は、図１２（Ａ）に示すように、特徴重みω_３とデータＸ_１，３との積を算出する。 In this case, the controller 17 transmits (broadcasts) the decomposed two-dimensional feature weight _ω3 to each of the calculators _U1,1 to _U4,4 as shown in Fig. 12(A). Then, when each of the calculators _U1,1 to _U4,4 holds data X, it multiplies the feature weight _ω3 by the data X. Specifically, for example, the calculator _U2,2 calculates the product of the feature weight _ω3 and the data _X1,3 as shown in Fig. 12(A).

続いて、演算器Ｕ_１，１から演算器Ｕ_４，４のそれぞれは、レジスタＲ３に格納された３次元のデータＺのうち、各演算器Ｕに対応する値に、算出した積を加算する。具体的に、例えば、演算器Ｕ_２，２は、図１２（Ｂ）のＳｔｅｐ３に示すように、レジスタＲ３に格納された３次元のデータＺのうち、演算器Ｕ_２，２に対応する値に、特徴重みω_３とデータＸ_１，３との積を加算する。 Next, each of the calculators _U1,1 to _U4,4 adds the calculated product to a value of the three-dimensional data Z stored in the register R3 that corresponds to each calculator U. Specifically, for example, as shown in Step 3 of Fig. 12B, calculator _U2,2 adds the product of the feature weight _ω3 and the data _X1,3 to a value of the three-dimensional data Z stored in the register R3 that corresponds to calculator _U2,2 .

そして、演算器Ｕ_１，１から演算器Ｕ_４，４は、保持しているデータＸを隣接している演算器Ｕに送信する。具体的に、例えば、演算器Ｕ_２，２は、データＸ_１，３を演算器Ｕ_２，３に送信する。 Then, the computing units _U1,1 to _U4,4 transmit the held data X to the adjacent computing units U. Specifically, for example, the computing unit _U2,2 transmits data _X1,3 to the computing unit _U2,3 .

その後、例えば、演算器Ｕ_２，２は、図１３及び図１４に示すように、特徴重みω_１とデータＸ_２，２との乗算（ｋが１の場合の演算）と、特徴重みω_２とデータＸ_２，３との乗算（ｋが２の場合の演算）と、特徴重みω_３とデータＸ_１，３との乗算（ｋが３の場合の演算）とに加え、特徴重みω_４とデータＸ_１，２との乗算（ｋが４の場合の演算）と、特徴重みω_５とデータＸ_１，１との乗算（ｋが５の場合の演算）と、特徴重みω_６とデータＸ_２，１との乗算（ｋが６の場合の演算）と、特徴重みω_７とデータＸ_３，１との乗算（ｋが７の場合の演算）と、特徴重みω_８とデータＸ_３，２との乗算（ｋが８の場合の演算）と、特徴重みω_９とデータＸ_３，３との乗算（ｋが９の場合の演算）とを行う。 Thereafter, for example, as shown in FIG. 13 and FIG. 14 , the calculator _U2,2 multiplies the feature weight _ω1 by the data _X2,2 (operation when k is 1), the feature weight _ω2 by the data _X2,3 (operation when k is 2), and the feature weight _ω3 by the data _X1,3 (operation when k is 3), and also multiplies the feature weight _ω4 by the data _X1,2 (operation when k is 4), the feature weight _ω5 by the data _X1,1 (operation when k is 5), the feature weight _ω6 by the data _X2,1 (operation when k is 6), the feature weight _ω7 by the data _X3,1 (operation when k is 7), the feature weight _ω8 by the data _X3,2 (operation when k is 8), and the feature weight _ω9 by the data _X3,3 (operation when k is 9).

すなわち、各演算器Ｕは、ｋの値が加算されるごとに、特徴重みのブロードキャスト送信とデータＸのシフトとを行いながら、各フィルタに含まれる特徴重みと入力データＸとの畳み込み演算を行うことで、畳み込み演算を演算器Ｕごとに並行して行う。 In other words, each time the value of k is incremented, each calculator U broadcasts the feature weight and shifts the data X, while performing a convolution operation between the feature weight contained in each filter and the input data X, thereby performing the convolution operation in parallel for each calculator U.

また、各演算器Ｕは、特徴重みとデータＸとの乗算を行う場合、図１５に示すように、Ｎｏｆ種類のフィルタごとに、各フィルタに含まれる特徴重みと入力データＸとの乗算を並行して行う。 When each calculator U multiplies a feature weight by data X, it performs multiplication of the feature weight included in each filter by the input data X in parallel for each Nof type of filter, as shown in FIG. 15.

具体的に、演算器Ｕ_２，２は、ｋが１である場合、１種類目のフィルタに含まれる１つ目の特徴重みω_１ ^（１）とデータＸ_２，２との乗算を行うことによって、積Ｚ_{２，２，１}を算出する。また、演算器Ｕ_２，２は、この場合、２種類目のフィルタに含まれる１つ目の特徴重みω_２ ^（１）とデータＸ_２，２との乗算を行うことによって、積Ｚ_{２，２，２}を算出する。同様に、演算器Ｕ_２，２は、Ｎｏｆ種類のフィルタのそれぞれに対応する乗算を行う。 Specifically, when k is 1, the calculator _U2,2 calculates the product Z2,2,1 by multiplying the first feature weight _ω1 ⁽¹⁾ included in the first type of filter by the data _X2,2 . In this case, the calculator _U2,2 calculates the product _Z2,2,2 by multiplying the first feature weight _ω2 ⁽¹⁾ included in the second type of filter by the data _X2,2 . Similarly, _{the calculator U2,2} _performs multiplications corresponding to each of the Nof types of filters.

これにより、各演算器Ｕは、演算器Ｕの数（Ｎｏｘ×Ｎｏｙ）とフィルタの種類数（Ｎｏｆ）との積に対応する多重度によって並列処理を行うことが可能になる。 This allows each calculator U to perform parallel processing with a degree of multiplicity that corresponds to the product of the number of calculators U (Nox x Noy) and the number of filter types (Nof).

さらに、本実施の形態におけるデータ処理装置１０では、上記のように各演算器Ｕが行う処理を、各ＴＰＥにおいて行う。 Furthermore, in the data processing device 10 of this embodiment, the processing performed by each computing unit U as described above is performed in each TPE.

これにより、データ処理装置１０では、各ＴＰＥにおける演算器Ｕの数（Ｎｏｘ×Ｎｏｙ）とフィルタの種類数（Ｎｏｆ）とＴＰＥの数（ｎ）との積に対応する多重度によって並列処理を行うことが可能になる。そのため、データ処理装置１０では、ＣＮＮにおける畳み込み演算を高速に行うことが可能になる。 As a result, the data processing device 10 is able to perform parallel processing with a degree of multiplicity corresponding to the product of the number of arithmetic units U in each TPE (Nox x Noy), the number of filter types (Nof), and the number of TPEs (n). Therefore, the data processing device 10 is able to perform convolution operations in CNN at high speed.

図７に戻り、ｉがＮｉｆ以上である場合（Ｓ２６のＹＥＳ）、コントローラ１７は、Ｓ２７以降の処理を再度行う。すなわち、コントローラ１７は、各ＴＰＥにおける畳み込み演算が完了したことに応じて、各ＴＰＥにおいて算出された値の総和を算出する処理を開始する。 Returning to FIG. 7, if i is equal to or greater than Nif (YES in S26), the controller 17 repeats the process from S27 onward. That is, upon completion of the convolution calculation in each TPE, the controller 17 starts the process of calculating the sum of the values calculated in each TPE.

そして、例えば、ＴＰＥｎがＴＰＥ１以外である場合（Ｓ２７のＮＯ）、コントローラ１７は、図８に示すように、隣接するＴＰＥ（ｎ－１）から３次元のデータＺ´を受信する（Ｓ３１）。 For example, if TPEn is other than TPE1 (NO in S27), the controller 17 receives three-dimensional data Z' from the adjacent TPE(n-1) (S31), as shown in FIG. 8.

続いて、Ｎｏｘ×Ｎｏｙ個の演算器Ｕは、Ｓ３１の処理で受信した３次元のデータＺ´に対して、レジスタＲ１に記憶された３次元のデータＺを加算する（Ｓ３２）。 Next, the Nox×Noy calculators U add the three-dimensional data Z stored in the register R1 to the three-dimensional data Z' received in the process of S31 (S32).

その後、Ｎｏｘ×Ｎｏｙの演算器Ｕは、Ｓ３２の処理で算出した３次元のデータＺをレジスタＲ１に記憶する（Ｓ３３）。 Then, the Nox x Noy calculator U stores the three-dimensional data Z calculated in the process of S32 in register R1 (S33).

そして、ＴＰＥｎがＴＰＥｍでない場合（Ｓ３４のＹＥＳ）、ＴＰＥｎは、レジスタＲ１に格納された３次元のデータＺを隣接するＴＰＥ（ｎ＋１）に送信する（Ｓ３５）。また、ＴＰＥｎは、レジスタＲ１に格納された３次元のデータＺをバッファ２３に格納する（Ｓ３６）。 If TPEn is not TPEm (YES in S34), TPEn transmits the three-dimensional data Z stored in register R1 to the adjacent TPE(n+1) (S35). In addition, TPEn stores the three-dimensional data Z stored in register R1 in buffer 23 (S36).

これにより、データ処理装置１０は、各ＴＰＥにおいて算出された畳み込み演算の結果の総和を算出することが可能になる。 This enables the data processing device 10 to calculate the sum of the results of the convolution operations calculated at each TPE.

なお、ＴＰＥｎがＴＰＥ１である場合（Ｓ２７のＹＥＳ）も同様に、ＴＰＥｎは、Ｓ３５の処理を行う。 Note that if TPEn is TPE1 (YES in S27), TPEn also performs the process of S35.

一方、ＴＰＥｎがＴＰＥｍである場合（Ｓ３４のＮＯ）、ＴＰＥｎは、Ｓ３５の処理を行わずに、レジスタＲ１に格納された３次元のデータＺをバッファ２３に格納する（Ｓ３６）。すなわち、コントローラ１７は、この場合、各ＴＰＥにおいて行われた畳み込み演算の最終的な結果をバッファ２３に格納する。 On the other hand, if TPEn is TPEm (NO in S34), TPEn stores the three-dimensional data Z stored in register R1 in buffer 23 without performing the process of S35 (S36). That is, in this case, controller 17 stores the final result of the convolution operation performed in each TPE in buffer 23.

このように、本実施の形態におけるデータ処理装置１０では、Ｎｏｆ、Ｎｏｘ、Ｎｏｙ及びＮｉｆに関する積和演算を並行して行うことが可能になる。そのため、本実施の形態におけるデータ処理装置１０は、図１６に示すように、畳み込み演算における積和演算の並列度を他の方法を採用した場合よりも大きくすることが可能になる。 In this way, the data processing device 10 in this embodiment is able to perform multiply-and-accumulate operations on Nof, Nox, Noy, and Nif in parallel. Therefore, as shown in FIG. 16, the data processing device 10 in this embodiment is able to increase the parallelism of the multiply-and-accumulate operations in the convolution operation compared to the case of using other methods.

従って、本実施の形態におけるデータ処理装置１０では、積和演算の実行回数を大幅に抑えることが可能になる。そのため、本実施の形態におけるデータ処理装置１０では、ＣＮＮの畳み込み層における畳み込み演算を高速に行うことが可能になる。 Therefore, in the data processing device 10 of this embodiment, it is possible to significantly reduce the number of times that multiply-and-accumulate operations are performed. Therefore, in the data processing device 10 of this embodiment, it is possible to perform convolution operations in the convolution layer of the CNN at high speed.

また、例えば、畳み込み演算の実行時間が閾値以下にすることが求められる場合、本実施の形態におけるデータ処理装置１０によれば、積和演算の実行回数を抑えることが可能になるため、積和演算の１回あたりの実行速度が遅くなるように設定することが許容される。すなわち、これは、データ処理装置１０の動作周波数の抑えることが可能になり、かつ、回路の駆動電圧を抑制することが可能になることを意味する。したがって、本実施の形態におけるデータ処理装置１０では、畳み込み演算の実行に伴う消費電力を抑えることが可能になる。 In addition, for example, when it is required that the execution time of a convolution operation be equal to or less than a threshold value, the data processing device 10 of this embodiment makes it possible to reduce the number of times that a multiply-add operation is executed, and therefore allows the execution speed of each multiply-add operation to be set to be slow. In other words, this means that it becomes possible to reduce the operating frequency of the data processing device 10 and to suppress the drive voltage of the circuit. Therefore, the data processing device 10 of this embodiment makes it possible to reduce the power consumption associated with the execution of a convolution operation.

そのため、本実施の形態におけるデータ処理装置１０は、様々な制約が求められる各分野（例えば、ロボットの動作制御や自動車の自動運転等の分野）においても使用することが可能になる。 As a result, the data processing device 10 in this embodiment can be used in various fields where various constraints are required (for example, fields such as robot operation control and automatic driving of automobiles).

１：ＴＰＥ
２：ＴＰＥ
ｎ：ＴＰＥ
ｍ：ＴＰＥ
１０：データ処理装置
１２：基盤
１３：ＣＰＵ
１４：チップ
１５：メモリ
１６：メモリ
１７：コントローラ
２０：ＳＴＰ
２１：バッファ
２２：バッファ
２３：バッファ
Ｒ１：レジスタ
Ｒ２：レジスタ
Ｒ３：レジスタ
Ｕ：演算器 1: TPE
2: TPE
n: TPE
m: TPE
10: Data processing device 12: Board 13: CPU
14: Chip 15: Memory 16: Memory 17: Controller 20: STP
21: Buffer 22: Buffer 23: Buffer R1: Register R2: Register R3: Register U: Arithmetic unit

Claims

A data processing device that calculates a feature amount of two-dimensional input data using a plurality of filters corresponding to the two-dimensional input data,
a processing element group consisting of a plurality of processing elements arranged two-dimensionally;
a first storage unit having a three-dimensional storage area for each of the plurality of filters for each of the plurality of processing elements,
The processing element group includes:
Sending each of the data included in the two-dimensional input data to one of the plurality of processing elements such that each of the sending destinations is a different processing element;
Obtaining one feature weight included in each of the plurality of filters;
generating three-dimensional feature weights by duplicating the one feature weight for each of the plurality of filters by the same number as the plurality of processing elements and arranging the duplicated feature weights in the same arrangement as the plurality of processing elements;
Transmitting the one feature weight for each of the plurality of filters to each of the plurality of processing elements;
causing each of the plurality of processing elements to multiply data included in the transmitted two-dimensional input data by the one feature weight for each of the transmitted plurality of filters to calculate a three-dimensional product;
adding the three-dimensional product to the value stored in the three-dimensional storage area in the first storage unit;
causing each of the plurality of processing elements to transmit data included in the two-dimensional input data held by each of the plurality of processing elements to an adjacent processing element such that data that has not yet been used in the calculation of the three-dimensional product is transmitted to each of the plurality of processing elements;
repeating the obtaining process, the generating process, the process of transmitting the one feature weight for each of the plurality of filters, the calculating process, the adding process, and the process of transmitting the data included in the two-dimensional input data to an adjacent processing element until the number of times the adding process has been performed reaches the number of feature weights included in each of the plurality of filters.
23. A data processing device comprising:

In claim 1,
The processing elements perform processing in a convolutional layer of a convolutional neural network.
23. A data processing device comprising:

In claim 1,
the processing element group outputs the three-dimensional values stored in the first storage unit after the repeated processing.
23. A data processing device comprising:

In claim 1,
a second storage unit having a three-dimensional storage area for each of the plurality of filters for each of the plurality of processing elements;
the processing element group stores the generated three-dimensional feature weights in the second storage unit;
23. A data processing device comprising:

In claim 1,
a third storage unit having a two-dimensional storage area for each of the plurality of processing elements;
the processing element group transmits each of the data included in the two-dimensional input data stored in the third storage unit to any of the plurality of processing elements such that each of the data destinations is a different processing element;
23. A data processing device comprising:

In claim 1,
a plurality of processing element groups each performing processing based on the two-dimensional input data different from each other and the plurality of filters different from each other;
a fourth storage unit having a three-dimensional storage area for each of the plurality of filters for each of the plurality of processing elements,
The plurality of processing element groups include
After the repeated process, a three-dimensional sum is calculated by adding up the three-dimensional values stored in the first storage unit of each processing element group;
The calculated three-dimensional sum value is stored in the fourth storage unit.
23. A data processing device comprising:

In claim 6,
the plurality of processing elements output the three-dimensional sum value stored in the fourth storage unit after the process of calculating the three-dimensional sum value.
23. A data processing device comprising:

1. A data processing method for a data processing device that calculates a feature amount of two-dimensional input data using a plurality of filters corresponding to the two-dimensional input data, comprising:
a processing element group consisting of a plurality of processing elements arranged two-dimensionally;
a first storage unit having a three-dimensional storage area for each of the plurality of filters for each of the plurality of processing elements,
The processing element group includes:
Sending each of the data included in the two-dimensional input data to one of the plurality of processing elements such that each of the sending destinations is a different processing element;
Obtaining one feature weight included in each of the plurality of filters;
generating three-dimensional feature weights by duplicating the one feature weight for each of the plurality of filters by the same number as the plurality of processing elements and arranging the duplicated feature weights in the same arrangement as the plurality of processing elements;
Transmitting the one feature weight for each of the plurality of filters to each of the plurality of processing elements;
causing each of the plurality of processing elements to multiply data included in the transmitted two-dimensional input data by the one feature weight for each of the transmitted plurality of filters to calculate a three-dimensional product;
adding the three-dimensional product to the value stored in the three-dimensional storage area in the first storage unit;
causing each of the plurality of processing elements to transmit data included in the two-dimensional input data held by each of the plurality of processing elements to an adjacent processing element such that data that has not yet been used in the calculation of the three-dimensional product is transmitted to each of the plurality of processing elements;
repeating the obtaining process, the generating process, the process of transmitting the one feature weight for each of the plurality of filters, the calculating process, the adding process, and the process of transmitting the data included in the two-dimensional input data to an adjacent processing element until the number of times the adding process has been performed reaches the number of feature weights included in each of the plurality of filters.
23. A data processing method comprising: