JP2021005242A

JP2021005242A - Information processing device, information processing program, and information processing method

Info

Publication number: JP2021005242A
Application number: JP2019119018A
Authority: JP
Inventors: 清水　俊宏; Toshihiro Shimizu; 俊宏清水
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2019-06-26
Filing date: 2019-06-26
Publication date: 2021-01-14
Anticipated expiration: 2039-06-26
Also published as: EP3757902A1; US20200410340A1; CN112149794A; US11631002B2; JP7251354B2

Abstract

To accelerate convolution calculation.SOLUTION: An information processing device comprises: a calculation section 42 in which, among combinations of t and q in which a sum of each element from a plurality of first matrices g and a plurality of second matrices d of t×t does not exceed a count of data that can be stored in each of a q count from storage areas R#0 to R#7 of a register G#0, a combination in which a calculation time minimizes when each of calculation cores C#0 to C#3 in the q count executes convolution calculation for the plurality of first matrices g and the plurality of second matrices d parallelly in the Winograd algorithm, is calculated; and an output section which outputs a program 50 for causing a computation machine 10 to execute, processing for storing the first matrices g and the second matrices d of the t×t to each of the storage areas R#0 to R#3 in the q count, and processing in which each of the calculation cores C#0 to C#3 in the q count performs the convolution calculation for the first matrices g and the second matrices d using the Winograd algorithm.SELECTED DRAWING: Figure 22

Description

本発明は、情報処理装置、情報処理プログラム、及び情報処理方法に関する。 The present invention relates to an information processing device, an information processing program, and an information processing method.

多層構造のニューラルネットワークを用いた機械学習は深層学習と呼ばれ、様々な分野に応用されている。その深層学習の各層においては様々な計算が行われる。例えば、畳み込み層では、画像データとフィルタとの間で畳み込み計算を行い、その結果が後段に出力される。畳み込み計算は行列同士の計算であるため計算量が多く、学習の処理速度が遅延する一因となる。そこで、畳み込み計算の計算量を低減するためのアルゴリズムとしてWinogradアルゴリズムが提案されている。 Machine learning using a multi-layered neural network is called deep learning and is applied to various fields. Various calculations are performed in each layer of the deep learning. For example, in the convolution layer, a convolution calculation is performed between the image data and the filter, and the result is output in the subsequent stage. Since the convolution calculation is a matrix-to-matrix calculation, the amount of calculation is large, which contributes to a delay in the learning processing speed. Therefore, the Winograd algorithm has been proposed as an algorithm for reducing the amount of convolution calculation.

“Fast Algorithms for Convolutional Neural Networks”, Andrew Lavin et al., The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4013-4021“Fast Algorithms for Convolutional Neural Networks”, Andrew Lavin et al., The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4013-4021 “Deep Residual Learning for Image Recognition”, Kaiming He et al., The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778“Deep Residual Learning for Image Recognition”, Kaiming He et al., The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778

しかしながら、Winogradアルゴリズムには、畳み込み計算の処理速度を更に速めるという点で改善の余地がある。 However, the Winograd algorithm has room for improvement in that it further speeds up the processing of convolution calculations.

一側面によれば、本発明は、畳み込み計算を高速化することを目的とする。 According to one aspect, the present invention aims to speed up the convolution calculation.

一側面によれば、複数の第１の行列とt行t列の複数の第２の行列の各々の要素の総数が、レジスタが備える複数の記憶領域のうちのq個の各々に格納できるデータの個数を超えないtとqの組み合わせのうちで、q個の前記記憶領域の各々に対応したq個の計算コアの各々が複数の前記第１の行列と複数の前記第２の行列との畳み込み計算をWinogradアルゴリズムで並列して実行するときの計算時間が最小となる組み合わせを算出する算出部と、算出したtとqの組み合わせを用いてq個の前記記憶領域の各々に複数の前記第１の行列とt行t列の複数の前記第２の行列とを格納する処理と、q個の前記計算コアの各々がWinogradアルゴリズムを用いて前記第１の行列と前記第２の行列との畳み込み計算を行う処理とを、前記計算コアと前記レジスタとを備えた計算機に実行させるためのプログラムを出力する出力部とを有する情報処理装置が提供される。 According to one aspect, the total number of elements of each of the plurality of first matrices and the plurality of second matrices of t rows and t columns is the data that can be stored in each of the q storage areas of the registers. Of the combinations of t and q that do not exceed the number of t and q, each of the q calculation cores corresponding to each of the q storage areas is a plurality of the first matrix and a plurality of the second matrix. A calculation unit that calculates the combination that minimizes the calculation time when the convolution calculation is executed in parallel by the Winograd algorithm, and a plurality of said firsts in each of the q storage areas using the calculated combination of t and q. The process of storing the matrix of 1 and the plurality of the second matrix of t rows and t columns, and each of the q calculation cores of the first matrix and the second matrix using the Winograd algorithm. An information processing apparatus is provided that includes an output unit that outputs a program for causing a computer having the calculation core and the register to execute a process of performing a convolution calculation.

一側面によれば、畳み込み計算を高速化することができる。 According to one aspect, the convolution calculation can be speeded up.

図１は、深層学習の処理の流れを模式的に示す図である。FIG. 1 is a diagram schematically showing a flow of deep learning processing. 図２は、畳み込み層で行う畳み込み計算について模式的に示す図である。FIG. 2 is a diagram schematically showing a convolution calculation performed in the convolution layer. 図３（ａ）〜（ｃ）は、bottom行列とweight行列との畳み込み計算を模式的に示す図である。3 (a) to 3 (c) are diagrams schematically showing the convolution calculation of the bottom matrix and the weight matrix. 図４（ａ）〜（ｃ）は、フォワード処理におけるWinogradアルゴリズムについて模式的に示す図である。4 (a) to 4 (c) are diagrams schematically showing the Winograd algorithm in the forward processing. 図５は、深層学習等における畳み込み計算を行うための計算機のハードウェア構成図である。FIG. 5 is a hardware configuration diagram of a computer for performing convolution calculation in deep learning and the like. 図６（ａ）は、一つのDPU-chainのハードウェア構成図であり、図６（ｂ）は、一つのDPUのハードウェア構成図である。FIG. 6A is a hardware configuration diagram of one DPU-chain, and FIG. 6B is a hardware configuration diagram of one DPU. 図７は、各DPEのハードウェア構成図である。FIG. 7 is a hardware configuration diagram of each DPE. 図８は、DPE0のハードウェア構成図である。FIG. 8 is a hardware configuration diagram of DPE0. 図９は、バンクR#0〜R#7の各々に付されたライン番号について説明するための図である。FIG. 9 is a diagram for explaining the line numbers assigned to each of the banks R # 0 to R # 7. 図１０（ａ）〜（ｃ）は、シーケンシャル方式について説明するための模式図（その１）である。10 (a) to 10 (c) are schematic views (No. 1) for explaining the sequential method. 図１１（ａ）〜（ｃ）は、シーケンシャル方式について説明するための模式図（その２）である。11 (a) to 11 (c) are schematic views (No. 2) for explaining the sequential method. 図１２は、マルチキャスト方式について説明するための模式図である。FIG. 12 is a schematic diagram for explaining the multicast method. 図１３は、各DPEのそれぞれのレジスタG#0の中身を模式的に示す図である。FIG. 13 is a diagram schematically showing the contents of the respective registers G # 0 of each DPE. 図１４は、メインメモリにある配列gの配列要素を模式的に示す図である。FIG. 14 is a diagram schematically showing the array elements of the array g in the main memory. 図１５は、マルチキャスト方式で転送された直後のDPE0のレジスタG#0の中身を示す図である。FIG. 15 is a diagram showing the contents of register G # 0 of DPE0 immediately after being transferred by the multicast method. 図１６は、整列後のDPE0のレジスタG#0の中身を示す図である。FIG. 16 is a diagram showing the contents of register G # 0 of DPE0 after alignment. 図１７は、整列後のDPE0〜DPE7の各々のレジスタG#0の中身を示す図である。FIG. 17 is a diagram showing the contents of each register G # 0 of DPE0 to DPE7 after alignment. 図１８は、DPE0のレジスタG#0のバンクR#0の模式図である。FIG. 18 is a schematic diagram of bank R # 0 of register G # 0 of DPE0. 図１９は、本実施形態に係る情報処理装置のハードウェア構成図である。FIG. 19 is a hardware configuration diagram of the information processing device according to the present embodiment. 図２０は、本実施形態に係る情報処理装置の機能構成図である。FIG. 20 is a functional configuration diagram of the information processing device according to the present embodiment. 図２１は、計算機の機能ブロック図である。FIG. 21 is a functional block diagram of the computer. 図２２は、本実施形態でフォワード処理を行う場合に、格納部によって各配列d、gが格納されたDPE0〜DPE7の各々のレジスタG#0の中身を示す図である。FIG. 22 is a diagram showing the contents of each register G # 0 of DPE0 to DPE7 in which the arrays d and g are stored by the storage unit when the forward processing is performed in the present embodiment. 図２３（ａ）、（ｂ）は、本実施形態において計算部がWinogradアルゴリズムで畳み込み計算を行う場合のDPE0の各レジスタG#0〜G#3の中身を示す図（その１）である。23 (a) and 23 (b) are diagrams (No. 1) showing the contents of each register G # 0 to G # 3 of DPE0 when the calculation unit performs convolution calculation by the Winograd algorithm in the present embodiment. 図２４は、本実施形態において計算部がWinogradアルゴリズムで畳み込み計算を行う場合のDPE0の各レジスタG#0〜G#3の中身を示す図（その２）である。FIG. 24 is a diagram (No. 2) showing the contents of each register G # 0 to G # 3 of DPE0 when the calculation unit performs convolution calculation by the Winograd algorithm in the present embodiment. 図２５は、本実施形態において計算部がWinogradアルゴリズムで畳み込み計算を行う場合のDPE0の各レジスタG#0〜G#3の中身を示す図（その３）である。FIG. 25 is a diagram (No. 3) showing the contents of each register G # 0 to G # 3 of DPE0 when the calculation unit performs convolution calculation by the Winograd algorithm in the present embodiment. 図２６は、本実施形態の式（１９）の計算をステップ順に示す模式図である。FIG. 26 is a schematic diagram showing the calculation of the formula (19) of the present embodiment in step order. 図２７は、本実施形態の式（２１）の計算をステップ順に示す模式図である。FIG. 27 is a schematic diagram showing the calculation of the formula (21) of the present embodiment in step order. 図２８は、本実施形態に係る情報処理方法のフローチャートである。FIG. 28 is a flowchart of the information processing method according to the present embodiment. 図２９（ａ）〜（ｃ）は、本実施形態に係るbackword処理において、top行列とweight行列との畳み込み計算をWinogradアルゴリズムで行うときの模式図である。29 (a) to 29 (c) are schematic diagrams when the convolution calculation of the top matrix and the weight matrix is performed by the Winograd algorithm in the backword processing according to the present embodiment. 図３０は、本実施形態に係る格納部によって各配列y、gが格納されたDPE0〜DPE7の各々のレジスタG#0の中身を示す図である。FIG. 30 is a diagram showing the contents of each register G # 0 of DPE0 to DPE7 in which the arrays y and g are stored by the storage unit according to the present embodiment. 図３１（ａ）、（ｂ）は、本実施形態に係るbackword処理において、top行列とbottom行列との畳み込み計算をWinogradアルゴリズムで行うときの模式図である。31 (a) and 31 (b) are schematic views when the convolution calculation of the top matrix and the bottom matrix is performed by the Winograd algorithm in the backword processing according to the present embodiment. 図３２（ａ）〜（ｃ）は、本実施形態に係るbackword処理において、top行列とbottom行列との畳み込み計算をWinogradアルゴリズムで行うときの模式図である。32 (a) to 32 (c) are schematic views when the convolution calculation of the top matrix and the bottom matrix is performed by the Winograd algorithm in the backword processing according to the present embodiment. 図３３は、本実施形態に係る格納部によって各配列y、dが格納されたDPE0〜DPE7の各々のレジスタG#0の中身を示す図である。FIG. 33 is a diagram showing the contents of each register G # 0 of DPE0 to DPE7 in which the arrays y and d are stored by the storage unit according to the present embodiment. 図３４は、本実施形態において１×１の畳み込みを行う場合に、格納部によって各配列d、gが格納されたDPE0のレジスタG#0の中身を示す図である。FIG. 34 is a diagram showing the contents of the register G # 0 of DPE0 in which the arrays d and g are stored by the storage unit when 1 × 1 convolution is performed in the present embodiment. 図３５は、batch normalizationのときに本実施形態に係る格納部によって小bottom行列dが格納されたDPE0のレジスタG#0の中身を示す図である。FIG. 35 is a diagram showing the contents of the register G # 0 of DPE0 in which the small bottom matrix d is stored by the storage unit according to the present embodiment at the time of batch normalization. 図３６（ａ）、（ｂ）は、batch normalizationのときに本実施形態に係る計算部が行う計算について説明するためのDPE0のレジスタG#0の中身を示す図である。36 (a) and 36 (b) are diagrams showing the contents of register G # 0 of DPE0 for explaining the calculation performed by the calculation unit according to the present embodiment at the time of batch normalization.

本実施形態の説明に先立ち、本願発明者が検討した事項について説明する。 Prior to the description of the present embodiment, the matters examined by the inventor of the present application will be described.

図１は、深層学習の処理の流れを模式的に示す図である。
深層学習では、画像等の識別対象に関する教師あり学習を行うことにより、ニューラルネットワークに識別対象の特徴を学習させる。そのように学習させたニューラルネットワークを用いることにより、識別対象を識別することができる。 FIG. 1 is a diagram schematically showing a flow of deep learning processing.
In deep learning, the neural network is made to learn the characteristics of the identification target by performing supervised learning regarding the identification target such as an image. By using the neural network trained in this way, the identification target can be identified.

ニューラルネットワークは、脳のニューロンを模したユニットを階層的に結合したネットワークである。各ユニットは、他のユニットからデータを受け取り、他のユニットへデータを受け渡す。ニューラルネットワークでは、ユニットのパラメータを学習によって変化させることで様々な識別対象を識別できる。 A neural network is a network in which units that imitate neurons in the brain are hierarchically connected. Each unit receives data from another unit and passes the data to the other unit. In a neural network, various identification targets can be identified by changing the parameters of the unit by learning.

以下では、図１を参照しながら、画像の認識に用いられる畳み込みニューラルネットワーク(CNN：Convolutional Neural Network)について説明する。 In the following, a convolutional neural network (CNN) used for image recognition will be described with reference to FIG.

このニューラルネットワークは、畳み込み(convolution)層、サブサンプリング(sub-sampling)層、及び全結合(fully-connected)層を備えた階層構造を有する。図１の例では畳み込み層とサブサンプリング層を交互に２回設けているが、これらの層を更に多く設けてもよい。更に、全結合層を複数設けてもよい。ニューラルネットワークの階層構造や各層の構成は、識別する対象などに応じて設計者が予め定めればよい。 This neural network has a hierarchical structure with a convolution layer, a sub-sampling layer, and a fully-connected layer. In the example of FIG. 1, the convolution layer and the subsampling layer are alternately provided twice, but more layers may be provided. Further, a plurality of fully bonded layers may be provided. The hierarchical structure of the neural network and the configuration of each layer may be predetermined by the designer according to the object to be identified.

ニューラルネットワークで画像の識別を行う処理はフォワード処理とも呼ばれる。フォワード処理では、図１に示すように、左から右に向かって畳み込み層とプーリング層とが交互に複数繰り返される。そして、最後に全結合層で画像に写った識別対象の識別が行われる。 The process of identifying an image with a neural network is also called a forward process. In the forward process, as shown in FIG. 1, a plurality of convolution layers and pooling layers are alternately repeated from left to right. Finally, the fully connected layer identifies the identification target reflected in the image.

また、ニューラルネットワークで画像の学習を行う処理はバックワード処理とも呼ばれる。バックワード処理では、識別した結果と正解との誤差を求め、それを右から左へニューラルネットワークに逆伝播させ、畳み込みニューラルネットワークの各階層のパラメータを変更する。 The process of learning an image with a neural network is also called backward process. In backward processing, the error between the identified result and the correct answer is obtained, and the error is propagated back to the neural network from right to left, and the parameters of each layer of the convolutional neural network are changed.

図２は、畳み込み層で行う畳み込み計算について模式的に示す図である。 FIG. 2 is a diagram schematically showing a convolution calculation performed in the convolution layer.

図２においては、入力画像の画素データが各要素に格納されたbottom行列と、入力画像に作用させるフィルタを表すweight行列との畳み込み計算について例示している。この例では、bottom行列とweight行列の各々を複数用意し、それらの間で畳み込みを行う。 In FIG. 2, a convolution calculation of a bottom matrix in which pixel data of an input image is stored in each element and a weight matrix representing a filter acting on the input image is illustrated. In this example, a plurality of bottom matrix and weight matrix are prepared, and convolution is performed between them.

なお、複数のbottom行列の各々は、バッチ数Nと入力チャネル番号Cinとにより識別される。一方、weight行列は、出力チャネル番号Coutと入力チャネル番号Cinとにより識別される。 Each of the plurality of bottom matrices is identified by the batch number N and the input channel number Cin. On the other hand, the weight matrix is identified by the output channel number Cout and the input channel number Cin.

図２の例では畳み込み計算は以下のようにして行われる。
まず、バッチ数Nと出力チャネル番号Coutの組み合わせを一つ選択する。例えば、N=0、Cout=0とする。 In the example of FIG. 2, the convolution calculation is performed as follows.
First, select one combination of batch number N and output channel number Cout. For example, N = 0 and Cout = 0.

そして、選択したバッチ数Nを有する複数のbottom行列と、選択した出力チャネル番号Coutを有する複数のweight行列との組み合わせのうちで、入力チャネル番号Cinが同一となる組み合わせを選択する。例えば、前述のようにN=0、Cout=0とした場合には、N=0かつCin=0のbottom行列と、Cout=0かつCin=0のweight行列とを選択する。 Then, among the combinations of the plurality of bottom matrices having the selected batch number N and the plurality of weight matrices having the selected output channel number Cout, the combination having the same input channel number Cin is selected. For example, when N = 0 and Cout = 0 as described above, a bottom matrix with N = 0 and Cin = 0 and a weight matrix with Cout = 0 and Cin = 0 are selected.

そして、選択したこれらのbottom行列とweight行列との間で畳み込みを行う。その畳み込みにより得られた行列を以下ではtop行列と呼ぶ。 Then, a convolution is performed between these selected bottom and weight matrices. The matrix obtained by the convolution is hereinafter referred to as the top matrix.

バッチ数Nと出力チャネル番号Coutとを固定した状態でこのような畳み込みをCin=0〜255の各々のbottom行列とweight行列に行うことで256個のtop行列が得られる。その後に、これら256個のtop行列の各々を足すことにより、バッチ数Nと出力チャネル番号Coutとで特定される一つの出力行列を得る。 256 top matrices can be obtained by performing such convolution in each bottom matrix and weight matrix of Cin = 0 to 255 with the batch number N and the output channel number Cout fixed. After that, by adding each of these 256 top matrices, one output matrix specified by the batch number N and the output channel number Cout is obtained.

更に、バッチ数Nと出力チャネル番号Coutとを変えながらこのような計算を行うことにより、最終的に全バッチ数N×全出力チャネル番号Coutの個数の出力行列が得られる。図２の例では、６４×３８４個の出力行列が得られる。 Further, by performing such a calculation while changing the number of batches N and the output channel number Cout, an output matrix of the total number of batches N × the total number of output channel numbers Cout is finally obtained. In the example of FIG. 2, 64 × 384 output matrices are obtained.

このようにして複数のbottom行列と複数のweight行列との畳み込み計算が行われる。 In this way, the convolution calculation of the plurality of bottom matrices and the plurality of weight matrices is performed.

そのような畳み込み計算では、前述のように入力チャネル番号Cinが同一の二つのbottom行列とweight行列との間で畳み込み計算が行われる。そこで、これらの行列の畳み込み計算について詳細に説明する。 In such a convolution calculation, as described above, the convolution calculation is performed between two bottom matrices and a weight matrix having the same input channel number Cin. Therefore, the convolution calculation of these matrices will be described in detail.

図３（ａ）〜（ｃ）は、bottom行列とweight行列との畳み込み計算を模式的に示す図である。 3 (a) to 3 (c) are diagrams schematically showing the convolution calculation of the bottom matrix and the weight matrix.

まず、図３（ａ）に示すように、畳み込みの対象となるbottom行列とweight行列を用意する。この例では、bottom行列を１３×１３の正方行列とし、weight行列を３×３の正方行列としている。 First, as shown in FIG. 3A, a bottom matrix and a weight matrix to be convolved are prepared. In this example, the bottom matrix is a 13 × 13 square matrix and the weight matrix is a 3 × 3 square matrix.

次に、図３（ｂ）に示すように、bottom行列の周囲を０パディングすることにより１５×１５の行列Mを得る。 Next, as shown in FIG. 3B, a 15 × 15 matrix M is obtained by 0 padding around the bottom matrix.

続いて、図３（ｃ）に示すように、行列Mにおいてweight行列と同じサイズの小行列P_ijを抽出する。以下ではその小行列P_ijのk行l列の要素を(P_ij)_kl(0≦k,l≦2)で表し、weight行列のk行l列の要素をg_kl(0≦k,l≦2)で表す。 Subsequently, as shown in FIG. 3C, a minor matrix P _ij having the same size as the weight matrix is extracted in the matrix M. In the following, the elements of the submatrix P _ij with k rows and l columns are represented by (P _ij ) _kl (0 ≤ k, l ≤ 2), and the elements of the weight matrix k rows and l columns are g _kl (0 ≤ k, l). It is represented by ≤2).

また、行列Mとweight行列との畳み込みで得られた行列を前述のようにtop行列と呼ぶ。この場合、top行列の各要素r_ijは、以下の式（１）から算出することができる。 The matrix obtained by convolving the matrix M and the weight matrix is called the top matrix as described above. In this case, each element r _ij of the top matrix can be calculated from the following equation (1).

但し、この方法では、top行列の一つの要素r_ijを求めるために、weight行列の要素数（３×３）と同じ数だけ乗算をする必要があり、畳み込み計算の高速化を実現できない。 However, in this method, in order to obtain one element r _ij of the top matrix, it is necessary to multiply by the same number as the number of elements (3 × 3) of the weight matrix, and it is not possible to realize high-speed convolution calculation.

畳み込み計算を高速化するアルゴリズムとしてWinogradアルゴリズムが知られている。そこで、以下にWinogradアルゴリズムについて説明する。 The Winograd algorithm is known as an algorithm for accelerating the convolution calculation. Therefore, the Winograd algorithm will be described below.

深層学習には前述のようにフォワード処理とバックワード処理とがあるが、ここではフォワード処理におけるWinogradアルゴリズムについて説明する。 As described above, deep learning includes forward processing and backward processing, but here, the Winograd algorithm in forward processing will be described.

図４（ａ）〜（ｃ）は、フォワード処理におけるWinogradアルゴリズムについて模式的に示す図である。 4 (a) to 4 (c) are diagrams schematically showing the Winograd algorithm in the forward processing.

まず、図４（ａ）に示すように、bottom行列からt×tの小bottom行列dを切り出す。なお、tは自然数である。
次に、次の式（２）に従い、小top行列yを求める。 First, as shown in FIG. 4A, a small bottom matrix d of t × t is cut out from the bottom matrix. Note that t is a natural number.
Next, the small top matrix y is obtained according to the following equation (2).

小top行列yは、top行列の一部を形成する行列である。

The small top matrix y is a matrix that forms part of the top matrix.

また、式（２）におけるB、G、Aは定数行列である。これらの定数行列B、G、Aの要素やサイズは、各行列g、dのサイズに応じて変わる。例えば、weight行列gのサイズが３×３であり、小bottom行列dのサイズが４×４の場合には、定数行列B、G、Aの要素とサイズは以下の式（３）のようになる。 Further, B, G, and A in the equation (2) are constant matrices. The elements and sizes of these constant matrices B, G, and A change according to the size of each of the matrices g and d. For example, when the size of the weight matrix g is 3 × 3 and the size of the small bottom matrix d is 4 × 4, the elements and sizes of the constant matrices B, G, and A are as shown in the following equation (3). Become.

なお、式（２）における演算子「◎」は行列の要素ごとの乗算である。例えば、型が同一の任意の行列U、Vの各々の要素をu_ij、v_ij、U◎Vのij要素を(U◎V)_ijとすると、(U◎V)_ij=u_ijv_ijとなる。 The operator "◎" in Eq. (2) is a multiplication for each element of the matrix. For example, if each element of any matrix U and V of the same type is u _ij , v _ij , and the ij element of U ◎ V is (U ◎ V) _ij , then (U ◎ V) _ij = u _ij v _ij It becomes.

次に、図４（ｂ）に示すように、bottom行列から小bottom行列dを切り出す位置を図４（ａ）の場合よりも２列ずらし、切り出した小bottom行列dに対して上記と同じ計算を行う。これにより得られた小top行列yは、top行列において、図４（ａ）で得た小top行列yの隣のブロックを形成する。 Next, as shown in FIG. 4 (b), the position where the small bottom matrix d is cut out from the bottom matrix is shifted by two columns from the case of FIG. 4 (a), and the same calculation as above is performed for the cut out small bottom matrix d. I do. The small top matrix y obtained in this way forms a block next to the small top matrix y obtained in FIG. 4A in the top matrix.

このようにbottom行列から小bottom行列dを切り出す位置を列方向と行方向に二個ずつずらすことにより、図４（ｃ）に示すように、各小top行列yで形成されるtop行列を得ることができる。 By shifting the positions for cutting out the small bottom matrix d from the bottom matrix by two in the column direction and the row direction in this way, as shown in FIG. 4C, a top matrix formed by each small top matrix y is obtained. be able to.

以上により、Winogradアルゴリズムを用いたbottom行列とtop行列との畳み込み計算を終える。 This completes the convolution calculation of the bottom matrix and top matrix using the Winograd algorithm.

式（２）のWinogradアルゴリズムでは、行列GgG^Tと行列B^TdBを一度作ってしまえば、後はそれらの要素ごとの積を計算するだけで畳み込みを行うことができるため、畳み込み計算を高速に行うことができる。 In the Winograd algorithm of equation (2), once the matrix GgG ^T and the matrix B ^T dB are created, the convolution can be performed simply by calculating the product of each element, so the convolution calculation can be performed at high speed. It can be carried out.

本願発明者は、この例のようにweight行列gのサイズが３×３であり、かつ小bottom行列dのサイズが４×４の場合の計算時間を試算した。その結果、Winogradアルゴリズムを使用しない図３（ａ）〜（ｃ）の例では、計算時間が１１５２サイクルとなった。なお、「サイクル」は、レジスタへの書き込み回数と等価である。 The inventor of the present application estimated the calculation time when the size of the weight matrix g is 3 × 3 and the size of the small bottom matrix d is 4 × 4 as in this example. As a result, in the examples of FIGS. 3A to 3C in which the Winograd algorithm is not used, the calculation time is 1152 cycles. The "cycle" is equivalent to the number of writes to the register.

一方、Winogradアルゴリズムでは計算時間は９４０サイクルとなり、図３（ａ）〜（ｃ）の例と比較して１．２３（＝１１５２／９４０）倍の高速化が図られることが明らかとなった。 On the other hand, in the Winograd algorithm, the calculation time is 940 cycles, and it is clarified that the speed is 1.23 (= 1152/940) times faster than the examples of FIGS. 3 (a) to 3 (c).

次に、このようなWinogradアルゴリズムを利用して畳み込み計算を行う計算機について説明する。 Next, a computer that performs convolution calculation using such a Winograd algorithm will be described.

図５は、深層学習等における畳み込み計算を行うための計算機のハードウェア構成図である。 FIG. 5 is a hardware configuration diagram of a computer for performing convolution calculation in deep learning and the like.

図５に示すように、この計算機１０は、バス１３を介して接続されたメインメモリ１１とプロセッサ１２とを有する。 As shown in FIG. 5, the computer 10 has a main memory 11 and a processor 12 connected via a bus 13.

このうち、メインメモリ１１はDRAM(Dynamic Random Access Memory)等のようにデータを一時的に記憶するデバイスであり、プロセッサ１２と協働して様々なプログラムを実行する。 Of these, the main memory 11 is a device that temporarily stores data such as DRAM (Dynamic Random Access Memory), and executes various programs in cooperation with the processor 12.

一方、プロセッサ１２は、ALU(arithmetic and logic unit)等の演算器を備えたハードウェアである。この例では、プロセッサ１２としてDLU(Deep Learning Unit: 登録商標)を使用する。DLUは、深層学習に適したアーキテクチャを有するプロセッサであり、８個のDPU(Deep learning Processing Unit)-chain１４を有する。 On the other hand, the processor 12 is hardware including an arithmetic unit such as an ALU (arithmetic and logic unit). In this example, DLU (Deep Learning Unit: registered trademark) is used as the processor 12. The DLU is a processor having an architecture suitable for deep learning, and has eight DPUs (Deep learning Processing Units) -chain 14.

図６（ａ）は、一つのDPU-chain１４のハードウェア構成図である。 FIG. 6A is a hardware configuration diagram of one DPU-chain 14.

図６（ａ）に示すように、DPU-chain１４は４個のDPU１５を備える。これらのDPU１５の各々において後述のように並列計算が行われる。 As shown in FIG. 6A, the DPU-chain 14 includes four DPUs 15. Parallel calculation is performed on each of these DPUs 15 as described later.

また、図６（ｂ）は、一つのDPU１５のハードウェア構成図である。 Further, FIG. 6B is a hardware configuration diagram of one DPU15.

図６（ｂ）に示すように、DPU１５は、１６個のDPE(Deep learning Processing Element)0〜15を有する。
図７は、各DPEのハードウェア構成図である。 As shown in FIG. 6B, the DPU 15 has 16 DPEs (Deep learning Processing Elements) 0 to 15.
FIG. 7 is a hardware configuration diagram of each DPE.

なお、図６（ｂ）に示したようにDPEの総数は１６であるが、以下ではそのうちのDPE0〜DPE7のみを示して説明する。 As shown in FIG. 6B, the total number of DPEs is 16, but only DPE0 to DPE7 will be described below.

図７に示すように、DPE0〜DPE7の各々は、８個の計算コアC#0〜C#7と、これらの計算コアC#0〜C#7が読み書き可能なレジスタファイル２０とを有する。 As shown in FIG. 7, each of DPE0 to DPE7 has eight calculation cores C # 0 to C # 7 and a register file 20 in which these calculation cores C # 0 to C # 7 can read and write.

このうち、計算コアC#0〜C#7は各々が独立したSIMD(Single Instruction Multiple Data)演算器であり、各計算コアC#0〜C#7において並列計算を実行することができる。 Of these, the calculation cores C # 0 to C # 7 are independent SIMD (Single Instruction Multiple Data) arithmetic units, and parallel calculation can be executed in each calculation core C # 0 to C # 7.

一方、レジスタファイル２０は、バス１３（図５参照）を介してメインメモリ１１に接続されており、メインメモリ１１から読み出されたデータを格納したり、計算コアC#0〜C#7が計算した計算結果を格納したりする。 On the other hand, the register file 20 is connected to the main memory 11 via the bus 13 (see FIG. 5), and stores the data read from the main memory 11 or the calculation cores C # 0 to C # 7. Stores the calculated calculation result.

この例では、レジスタファイル２０を４個のレジスタG#0〜G#3に分け、各々が並行して読み出しや書き込みが行えるようにする。例えば、レジスタG#0がメインメモリ１１からデータを読み出している場合には、それと並行して計算コアC#0〜C#7が計算した計算結果をレジスタG#1に格納することができる。 In this example, the register file 20 is divided into four registers G # 0 to G # 3, and each of them can be read and written in parallel. For example, when the register G # 0 reads data from the main memory 11, the calculation results calculated by the calculation cores C # 0 to C # 7 can be stored in the register G # 1 in parallel with the data.

図８は、DPE0のハードウェア構成図である。
なお、DPE1〜DPE15もこれと同様のハードウェア構成を有するため、その説明については省略する。また、図８では、レジスタファイル２０のレジスタG#0〜G#3のうち、レジスタG#0のハードウェア構成のみを示している。他のレジスタG#1〜G#3もこれと同様のハードウェア構成を有する。 FIG. 8 is a hardware configuration diagram of DPE0.
Since DPE1 to DPE15 also have the same hardware configuration, the description thereof will be omitted. Further, FIG. 8 shows only the hardware configuration of the register G # 0 among the registers G # 0 to G # 3 of the register file 20. The other registers G # 1 to G # 3 have a similar hardware configuration.

図８に示すように、レジスタG#0は、８個のバンクR#0〜R#7を備える。バンクR#0〜R#7は、それぞれが記憶領域の一例であって、計算コアC#0〜C#7の各々に対応して設けられる。例えば、バンクR#0は、計算コアC#0に対応して設けられた記憶領域である。計算コアC#0が計算を行うときには、バンクR#0にあるデータを計算コアC#0が読み込んだり、計算コアC#0が計算結果をバンクR#0に書き込んだりすることになる。 As shown in FIG. 8, the register G # 0 includes eight banks R # 0 to R # 7. Each of the banks R # 0 to R # 7 is an example of a storage area, and is provided corresponding to each of the calculation cores C # 0 to C # 7. For example, bank R # 0 is a storage area provided corresponding to calculation core C # 0. When the calculation core C # 0 performs the calculation, the calculation core C # 0 reads the data in the bank R # 0, and the calculation core C # 0 writes the calculation result to the bank R # 0.

図９は、バンクR#0〜R#7の各々に付されたライン番号について説明するための図である。 FIG. 9 is a diagram for explaining the line numbers assigned to each of the banks R # 0 to R # 7.

ライン番号は、バンクR#0〜R#7のそれぞれのエントリを識別するための識別子であり、この例ではL₀〜L₁₂₇の１２８個のライン番号を使用する。各エントリに格納されるデータは特に限定されない。この例では、浮動小数点型のデータを一つのエントリに格納する。これによれば、１２７個の浮動小数点型のデータをバンクR#0に格納できる。バンクR#1〜R#7についても同様である。 The line number is an identifier for identifying each entry of banks R # 0 to R # 7, and in this example, 128 line numbers of L _{0 to} L ₁₂₇ are used. The data stored in each entry is not particularly limited. In this example, floating point type data is stored in one entry. According to this, 127 floating point type data can be stored in bank R # 0. The same applies to banks R # 1 to R # 7.

また、深層学習の畳み込み計算を行う場合には、畳み込み計算の対象となる行列の要素が各エントリに格納される。その場合、行列の要素は、メインメモリ１１において配列要素として格納されている。 Further, when the convolution calculation of deep learning is performed, the elements of the matrix to be the convolution calculation are stored in each entry. In that case, the matrix elements are stored as array elements in the main memory 11.

そこで、次に、メインメモリ１１に格納されている配列要素をDPE0〜DPE7に展開する展開方法について説明する。 Therefore, next, an expansion method for expanding the array elements stored in the main memory 11 to DPE0 to DPE7 will be described.

その展開方法にはシーケンシャル方式とマルチキャスト方式とがある。
まず、シーケンシャル方式について説明する。 There are sequential method and multicast method as the expansion method.
First, the sequential method will be described.

図１０及び図１１は、シーケンシャル方式について説明するための模式図である。 10 and 11 are schematic views for explaining the sequential method.

この例では、メインメモリ１１に格納された配列要素a[0]、a[1]、a[2]、…a[127]をDPE0〜DPE7に展開するものとする。 In this example, it is assumed that the array elements a [0], a [1], a [2], ... a [127] stored in the main memory 11 are expanded to DPE0 to DPE7.

この場合は、まず図１０（ａ）に示すように、DPE0のバンクR#0においてライン番号L₀で特定されるエントリに最初の配列要素a[0]を格納する。 In this case, first, as shown in FIG. 10A, the first array element a [0] is stored in the entry specified by the line number L ₀ in the bank R # 0 of DPE 0.

次に、図１０（ｂ）に示すように、ライン番号L₀を変えずに隣のバンクR#1に次の配列要素a[1]を格納する。 Next, as shown in FIG. 10B, the next array element a [1] is stored in the adjacent bank R # 1 without changing the line number L ₀ .

そして、図１０（ｃ）に示すように、ライン番号L₀を変えずに隣のバンクに次々と要素を格納していく。これにより、DPE0〜DPE7の各バンクR#0〜R#7においてライン番号L₀で特定されるエントリが、配列要素a[0]、a[1]、a[2]、…a[63]で埋められることになる。 Then, as shown in FIG. 10 (c), the elements are stored one after another in the adjacent bank without changing the line number L ₀ . As a result, the entries specified by the line number L ₀ in each bank R # ₀ to R # 7 of DPE0 to DPE7 are array elements a [0], a [1], a [2], ... a [63]. Will be filled with.

この後は、図１１（ａ）に示すように、DPE0のバンクR#0においてライン番号L₁で特定されるエントリに次の配列要素a[64]を格納する。 Thereafter, as shown in FIG. 11 (a), stores the next array element a [64] in the entry specified by the line number L ₁ in the bank R # 0 of DPE0.

そして、図１１（ｂ）に示すように、ライン番号L₁を変えずに隣のバンクR#1に次の配列要素a[65]を格納する。 Then, as shown in FIG. 11 (b), without changing the line number L ₁ to the bank R # 1 next store the next array element a [65].

更に、このようにライン番号L₁を変えずに隣のバンクに次々と配列要素を格納していく。これにより、図１１（ｃ）に示すように、DPE0〜DPE7の各バンクR#0〜R#7においてライン番号L₁で特定されるエントリが配列要素a[64]、a[65]、a[66]、…a[127]で埋められる。 Furthermore, the array elements are stored one after another in the adjacent bank without changing the line number L ₁ in this way. As a result, as shown in FIG. 11 (c), the entry specified by the line number L ₁ in each bank R # 0 to R # 7 of DPE0 to DPE7 is the array element a [64], a [65], a. Filled with [66],… a [127].

以上により、シーケンシャル方式により配列要素a[0]、a[1]、a[2]、…a[127]がDPE0〜DPE7に展開されたことになる。このようなシーケンシャル方式によれば、DPE0〜DPE7の同一のライン番号L_iにあるエントリが順に埋められていき、そのライン番号L_iの最後のエントリが埋まったところで次のライン番号L_i+1に配列要素が格納されていく。 From the above, the array elements a [0], a [1], a [2], ... a [127] are expanded to DPE0 to DPE7 by the sequential method. According to such a sequential method, entries that are in the same line number L _i of DPE0~DPE7 is going to be filled in the order, the next line number where the last entry in the line number L _i are filled L _{i + 1} Array elements are stored in.

次に、マルチキャスト方式について説明する。
図１２は、マルチキャスト方式について説明するための模式図である。 Next, the multicast method will be described.
FIG. 12 is a schematic diagram for explaining the multicast method.

この例では、メインメモリ１１に格納された配列要素a[0]、a[1]、a[2]、…a[23]をDPE0〜DPE7に展開するものとする。 In this example, it is assumed that the array elements a [0], a [1], a [2], ... a [23] stored in the main memory 11 are expanded to DPE0 to DPE7.

マルチキャスト方式では、DPE0にa[0]、a[1]、a[2]、…a[23]を順に格納していく。そして、これと同様にしてDPE1〜DPE7の各々にa[0]、a[1]、a[2]、…a[23]を格納する。この方法によれば、DPE0〜DPE7のそれぞれに格納される配列要素が同一となる。 In the multicast method, a [0], a [1], a [2], ... a [23] are stored in DPE0 in order. Then, in the same manner as this, a [0], a [1], a [2], ... A [23] are stored in each of DPE1 to DPE7. According to this method, the array elements stored in each of DPE0 to DPE7 are the same.

次に、計算機１０においてWinogradアルゴリズムで畳み込み計算を行う場合のレジスタの中身について説明する。 Next, the contents of the register when the convolution calculation is performed by the Winograd algorithm in the computer 10 will be described.

図１３は、各DPEのそれぞれのレジスタG#0の中身を模式的に示す図である。 FIG. 13 is a diagram schematically showing the contents of the respective registers G # 0 of each DPE.

以下では、行列を表す記号と、その行列の各要素を格納した配列とを同じ記号で表す。例えば、t×tのbottom行列dの各要素を格納する配列をdで表し、３×３のweight行列gの各要素を格納する配列をgで表す。 In the following, the symbol representing the matrix and the array containing each element of the matrix are represented by the same symbol. For example, an array that stores each element of the t × t bottom matrix d is represented by d, and an array that stores each element of the 3 × 3 weight matrix g is represented by g.

そして、これらの配列d、gを以下の式（４）のように記述する。

Then, these arrays d and g are described by the following equation (4).

式（４）においてNは、0〜63の値をとるバッチ数である。また、Cinは0〜255の値をとる入力チャネル番号であり、Coutは0〜383の値をとる出力チャネル数である。 In equation (4), N is the number of batches taking a value from 0 to 63. Cin is an input channel number that takes a value from 0 to 255, and Cout is an output channel number that takes a value from 0 to 383.

そして、HとWは、一つのbottom行列における要素を特定する変数である。同様に、H’とW’は、一つのweight行列における要素を特定する変数である。 And H and W are variables that specify the elements in one bottom matrix. Similarly, H'and W'are variables that identify elements in a weight matrix.

この場合、配列dは、シーケンシャル方式によりDPE0〜DPE7のそれぞれのレジスタG#0に展開される。 In this case, the array d is expanded to the respective registers G # 0 of DPE0 to DPE7 by the sequential method.

配列dのような多重配列の場合は、最下位の配列要素から順にレジスタG#0に格納される。配列dの最下位の要素はバッチ数Nで特定される。よって、DPE0のバンクR#0、R#1、…R#7の順に、バッチ数Nが0、1、…7の配列要素が格納される。そして、バッチ数Nが8、9、…15の配列要素は、DPE1のバンクR#0、R#1、…R#7に順に格納される。このようにしてバッチ数Nが0〜63の各要素がDPE0〜DPE7に展開される。 In the case of a multiple array such as array d, it is stored in register G # 0 in order from the lowest array element. The lowest element of the array d is specified by the number of batches N. Therefore, array elements having batch numbers N of 0, 1, ... 7 are stored in the order of banks R # 0, R # 1, ... R # 7 of DPE0. Then, the array elements having batch numbers N of 8, 9, ... 15 are stored in order in banks R # 0, R # 1, ... R # 7 of DPE1. In this way, each element having a batch number N of 0 to 63 is expanded to DPE0 to DPE7.

また、配列d[Cin][H][W][N]において、Cin、H、Wで特定される上位の要素については以下のように取り扱う。 In addition, in the array d [Cin] [H] [W] [N], the higher-order elements specified by Cin, H, and W are handled as follows.

まず、図４（ａ）に示したように、bottom行列からt×tの小bottom行列dを切り出す位置を固定し、その小bottom行列dのt×t個の要素を[H][W]に格納する。そして、Cinについては、0〜255の値のうちの最初の0〜4とする。 First, as shown in FIG. 4A, the position where the small bottom matrix d of t × t is cut out from the bottom matrix is fixed, and the t × t elements of the small bottom matrix d are [H] [W]. Store in. Then, for Cin, the first 0 to 4 of the values from 0 to 255 is used.

これによれば、Cin=0に対応するt×tの行列要素がDPE0〜DPE7のそれぞれに展開される。同様に、Cin=1、Cin=2、Cin=3のそれぞれに対応するt×tの行列要素もDPE0〜DPE7に展開される。 According to this, the matrix elements of t × t corresponding to Cin = 0 are expanded in each of DPE0 to DPE7. Similarly, the t × t matrix elements corresponding to Cin = 1, Cin = 2, and Cin = 3 are also expanded to DPE0 to DPE7.

一方、配列gについては、マルチキャスト方式によりDPE0〜DPE7のそれぞれのレジスタG#0に展開される。 On the other hand, the array g is expanded to the respective registers G # 0 of DPE0 to DPE7 by the multicast method.

この例では、Coutの値が0〜7の配列要素を、入力チャネル番号Cinごとにマルチキャストを行う。例えば、Coutの値が0〜7の配列要素のうち、Cin=0の要素をDPE0〜DPE7のそれぞれにマルチキャストする。Cin=0、Cin=1、Cin=2の配列要素についても同様にマルチキャストによりDPE0〜DPE7に転送する。 In this example, array elements with Cout values 0 to 7 are multicast for each input channel number Cin. For example, among the array elements whose Cout values are 0 to 7, the elements with Cin = 0 are multicast to each of DPE0 to DPE7. Similarly, the array elements of Cin = 0, Cin = 1, and Cin = 2 are also forwarded to DPE0 to DPE7 by multicast.

但し、このようにマルチキャスト方式で配列gを転送すると、DPE0のバンクR#0における入力チャネル番号Cinと出力チャネルCoutの値に規則性がなくなる。これでは、そのバンクR#0に対応した計算コアC#0が、Winogradアルゴリズムで配列g、dを畳み込むのに不便である。他の計算コアC#1〜C#7や、DPE1〜DPE7においても同様である。
そこで、配列gの要素については以下のように並び替えを行う。 However, when the array g is transferred by the multicast method in this way, the values of the input channel number Cin and the output channel Cout in the bank R # 0 of DPE0 lose their regularity. This is inconvenient for the compute core C # 0 corresponding to the bank R # 0 to convolve the arrays g and d with the Winograd algorithm. The same applies to other calculation cores C # 1 to C # 7 and DPE1 to DPE7.
Therefore, the elements of the array g are rearranged as follows.

図１４は、メインメモリ１１にある配列gの配列要素を模式的に示す図である。 FIG. 14 is a diagram schematically showing the array elements of the array g in the main memory 11.

前述のように、配列gは、weight行列を表す配列であり、３×３の正方行列に対応させることができる。そこで、以下ではこの３×３の正方行列の各要素に順に0、2、…8の数字を割り当て、これらの数字で各要素を識別する。 As described above, the array g is an array representing a weight matrix and can correspond to a 3 × 3 square matrix. Therefore, in the following, numbers 0, 2, ... 8 are assigned to each element of this 3 × 3 square matrix in order, and each element is identified by these numbers.

これによれば、式（４）のようにg[Cout][Cin][H’][W’]と記述した場合、[H’]と[W’]の各々に0、2、…8の数字が割り当てられることになる。 According to this, when g [Cout] [Cin] [H'] [W'] is described as in equation (4), 0, 2, ... 8 for each of [H'] and [W']. The number will be assigned.

図１５は、前述のマルチキャスト方式で転送された直後のDPE0のレジスタG#0の中身を示す図である。 FIG. 15 is a diagram showing the contents of register G # 0 of DPE0 immediately after being transferred by the above-mentioned multicast method.

図１５に示すように、マルチキャスト方式で転送を行うと、g[Cout][Cin][H’][W’]の下位の要素から順にバンクR#0〜R#7の一つのラインが埋められる。そして、そのラインの最後のバンクR#7が埋まると、一つ上のラインが順に埋められていく。 As shown in FIG. 15, when forwarding is performed by the multicast method, one line of banks R # 0 to R # 7 is filled in order from the lower elements of g [Cout] [Cin] [H'] [W']. Be done. Then, when the last bank R # 7 of that line is filled, the next higher line is filled in order.

weight行列gの要素数は９であるのに対し、バンクR#0〜R#7の個数は８個であり、両者の数は一致しない。よって、このようにマルチキャスト方式でレジスタに転送を行うと、Cin=0かつCout=0の９個の要素が二つのラインをまたいでレジスタに格納されることになる。他のCinとCoutの組み合わせでも同様である。 While the number of elements in the weight matrix g is 9, the number of banks R # 0 to R # 7 is 8, and the numbers do not match. Therefore, when the transfer is performed to the register by the multicast method in this way, the nine elements of Cin = 0 and Cout = 0 are stored in the register across the two lines. The same applies to other combinations of Cin and Cout.

これにより、バンクR#0にはCinやCoutの値が様々な配列要素が格納されてしまい、バンクR#0におけるCinとCoutの規則性が低下する。 As a result, array elements with various Cin and Cout values are stored in bank R # 0, and the regularity of Cin and Cout in bank R # 0 is reduced.

そこで、この例では、DPE0の各計算コアC#0〜C#7が、DPE0の残りのレジスタG#1〜G#3のいずれかをバッファとしながら、レジスタG#0における配列gの要素を整列させる。 Therefore, in this example, each calculation core C # 0 to C # 7 of DPE0 buffers one of the remaining registers G # 1 to G # 3 of DPE0, and sets the elements of the array g in register G # 0. Align.

図１６は、整列後のDPE0のレジスタG#0の中身を示す図である。 FIG. 16 is a diagram showing the contents of register G # 0 of DPE0 after alignment.

図１６に示すように、整列をすることによって、Coutの値が同一の要素は同一のバンクに格納される。例えば、バンクR#0にはCout=0の要素のみが格納される。 As shown in FIG. 16, by aligning, elements having the same Cout value are stored in the same bank. For example, bank R # 0 stores only elements with Cout = 0.

図１７は、このように整列をした後のDPE0〜DPE7の各々のレジスタG#0の中身を示す図である。 FIG. 17 is a diagram showing the contents of each register G # 0 of DPE0 to DPE7 after the alignment in this way.

図１７に示すように、例えばDPE0のバンクR#0に格納される配列gの要素は、Cout=0かつCin=0〜3の要素となる。また、このバンクR#0に格納される配列dの要素は、N=0かつCin=0〜3の要素である。 As shown in FIG. 17, for example, the elements of the array g stored in the bank R # 0 of DPE0 are elements of Cout = 0 and Cin = 0 to 3. The elements of the array d stored in this bank R # 0 are elements of N = 0 and Cin = 0 to 3.

これにより、バンクR#0における配列d、gのそれぞれのCinの値が同一となり、Cinの値が同一の配列d、g同士の畳み込みを計算コアC#0がWinogradアルゴリズムに従って実行できるようになる。 As a result, the Cin values of the arrays d and g in the bank R # 0 are the same, and the convolution of the arrays d and g having the same Cin values can be executed by the calculation core C # 0 according to the Winograd algorithm. ..

また、各バンクR#0〜R#7とバッチ数Nとは一対一に対応しており、異なるバッチ数に対する畳み込み計算が各バンクR#0〜R#7において実行される。これについては他のDPE1〜DPE7でも同様である。 In addition, each bank R # 0 to R # 7 and the number of batches N have a one-to-one correspondence, and convolution calculation for different numbers of batches is executed in each bank R # 0 to R # 7. This also applies to other DPE1 to DPE7.

そして、このような畳み込み計算を各DPE0〜DPE7の各計算コアC#0〜C#7が並列実行することにより、深層学習のフォワード処理やバックワード処理を高速に実行できると期待される。 Then, it is expected that the forward processing and backward processing of deep learning can be executed at high speed by executing such convolution calculation in parallel by the calculation cores C # 0 to C # 7 of each DPE0 to DPE7.

しかし、本願発明者が検討したところ、このように各バンクR#0〜R#7とバッチ数Nとを一対一に対応させる方法には以下のような問題があることが明らかとなった。 However, as a result of the examination by the inventor of the present application, it has become clear that the method of making each bank R # 0 to R # 7 and the number of batches N have a one-to-one correspondence has the following problems.

図１８は、その問題について説明するための図であり、DPE0のレジスタG#0のバンクR#0の模式図である。 FIG. 18 is a diagram for explaining the problem, and is a schematic diagram of bank R # 0 of register G # 0 of DPE0.

この例では、各バンクR#0〜R#7とバッチ数Nとを一対一に対応させつつ、入力チャネル番号Cinが相互に等しい小bottom行列dとweight行列gとを一つのバンクに格納する。よって、一つのバンクに小bottom行列dとweight行列をそれぞれ同じ個数だけ格納する必要が生じ、小bottom行列dのサイズを大きくしようとすると小bottom行列dの要素がバンクから溢れてしまう。 In this example, each bank R # 0 to R # 7 and the number of batches N have a one-to-one correspondence, and a small bottom matrix d and a weight matrix g having the same input channel number Cin are stored in one bank. .. Therefore, it is necessary to store the same number of small bottom matrix d and weight matrix in one bank, and if the size of the small bottom matrix d is increased, the elements of the small bottom matrix d overflow from the bank.

例えば、図１８のように、４個の小bottom行列dと４個のweight行列gとをバンクR#0に格納する場合を考える。小bottom行列dのサイズはt×tであり、weight行列gのサイズは３×３である。よって、バンクR#0に格納される要素数は、４×t²＋４×３^２個となる。前述のように一つのバンクに格納可能なデータの個数は１２７個であるから、これを要素数が超えないようにするにはtを４以下にする必要がある。 For example, consider a case where four small bottom matrices d and four weight matrices g are stored in bank R # 0 as shown in FIG. The size of the small bottom matrix d is t × t, and the size of the weight matrix g is 3 × 3. Therefore, the number of elements stored in bank R # 0 is 4 × t ² + 4 × 3 ² . As described above, the number of data that can be stored in one bank is 127, so t must be 4 or less in order to prevent the number of elements from exceeding this.

このようにtが小さいと、式（２）で得られる小top行列yのサイズも小さくなるため、top行列を得るために多数の小top行列yを計算しなければならず、畳み込みに要する計算時間が長くなってしまう。その結果、畳み込み計算を高速化できるというWinogradアルゴリズムの特徴を十分に活かすことができなくなってしまう。 When t is small in this way, the size of the small top matrix y obtained by Eq. (2) is also small, so a large number of small top matrices y must be calculated in order to obtain the top matrix, and the calculation required for convolution. The time will be long. As a result, the feature of the Winograd algorithm, which can speed up the convolution calculation, cannot be fully utilized.

以下に、畳み込み計算を高速に実行することが可能な各実施形態について説明する。 Hereinafter, each embodiment capable of executing the convolution calculation at high speed will be described.

（本実施形態）
図１９は、本実施形態に係る情報処理装置３１のハードウェア構成図である。 (This Embodiment)
FIG. 19 is a hardware configuration diagram of the information processing device 31 according to the present embodiment.

情報処理装置３１は、計算機１０（図５参照）で実行可能なプログラムを生成するためのPC(Personal Computer)等のコンピュータであり、記憶装置３２、メインメモリ３３、プロセッサ３４、入力装置３５、及び表示装置３６を備える。これらの各部はバス３７によって相互に接続される。 The information processing device 31 is a computer such as a PC (Personal Computer) for generating a program that can be executed by the computer 10 (see FIG. 5), and includes a storage device 32, a main memory 33, a processor 34, an input device 35, and an input device 35. A display device 36 is provided. Each of these parts is connected to each other by a bus 37.

このうち、記憶装置３２は、例えばHDD(Hard Disk Drive)やSSD(Solid State Drive)等の二次記憶装置であり、本実施形態に係る情報処理プログラム３９を記憶する。 Of these, the storage device 32 is a secondary storage device such as an HDD (Hard Disk Drive) or SSD (Solid State Drive), and stores the information processing program 39 according to the present embodiment.

その情報処理プログラム３９を実行することにより、後述のように計算機１０（図５参照）で実行可能なプログラムを生成することができる。 By executing the information processing program 39, a program that can be executed by the computer 10 (see FIG. 5) can be generated as described later.

なお、情報処理プログラム３９をコンピュータが読み取り可能な記録媒体３８に記録させておき、プロセッサ３４に記録媒体３８の情報処理プログラム３９を読み取らせるようにしてもよい。 The information processing program 39 may be recorded on a computer-readable recording medium 38, and the processor 34 may be made to read the information processing program 39 of the recording medium 38.

そのような記録媒体３８としては、例えばCD-ROM(Compact Disc - Read Only Memory)、DVD(Digital Versatile Disc)、及びUSB(Universal Serial Bus)メモリ等の物理的な可搬型記録媒体がある。また、フラッシュメモリ等の半導体メモリやハードディスクドライブを記録媒体３８として使用してもよい。これらの記録媒体３８は、物理的な形態を持たない搬送波のような一時的な媒体ではない。 Examples of such a recording medium 38 include a physically portable recording medium such as a CD-ROM (Compact Disc-Read Only Memory), a DVD (Digital Versatile Disc), and a USB (Universal Serial Bus) memory. Further, a semiconductor memory such as a flash memory or a hard disk drive may be used as the recording medium 38. These recording media 38 are not temporary media such as a carrier wave having no physical form.

更に、公衆回線、インターネット、及びLAN(Local Area Network)等に接続された装置に情報処理プログラム３９を記憶させておき、プロセッサ３４が情報処理プログラム３９を読み出して実行するようにしてもよい。 Further, the information processing program 39 may be stored in a device connected to a public line, the Internet, a LAN (Local Area Network), or the like, and the processor 34 may read and execute the information processing program 39.

一方、メインメモリ３３は、DRAM(Dynamic Random Access Memory)等のようにデータを一時的に記憶するハードウェアであって、その上に情報処理プログラム３９が展開される。 On the other hand, the main memory 33 is hardware that temporarily stores data such as DRAM (Dynamic Random Access Memory), and the information processing program 39 is deployed on the hardware.

プロセッサ３４は、自装置の各部を制御したり、メインメモリ３３と協働して情報処理プログラム３９を実行したりするCPU(Central Processing Unit)等のハードウェアである。 The processor 34 is hardware such as a CPU (Central Processing Unit) that controls each part of its own device and executes an information processing program 39 in cooperation with a main memory 33.

入力装置３５は、ユーザが操作するキーボードやマウス等の入力デバイスである。また、表示装置３６は、情報処理プログラム３９の実行時にユーザが使用する様々なコマンドを表示する液晶ディスプレイ等の表示デバイスである。 The input device 35 is an input device such as a keyboard or a mouse operated by the user. Further, the display device 36 is a display device such as a liquid crystal display that displays various commands used by the user when the information processing program 39 is executed.

図２０は、本実施形態に係る情報処理装置３１の機能構成図である。
図２０に示すように、情報処理装置３１は、出力部４１と算出部４２とを有する。これらの各部は、プロセッサ３４とメインメモリ３３が協働して前述の情報処理プログラム３９を実行することにより実現される。 FIG. 20 is a functional configuration diagram of the information processing device 31 according to the present embodiment.
As shown in FIG. 20, the information processing device 31 has an output unit 41 and a calculation unit 42. Each of these parts is realized by the processor 34 and the main memory 33 collaborating to execute the above-mentioned information processing program 39.

このうち、出力部４１は、計算機１０（図５参照）で実行可能なプログラム５０を生成する機能ブロックである。そのプログラムは、中間コードが記述されたファイルでもよいし、実行可能なバイナリファイルでもよい。 Of these, the output unit 41 is a functional block that generates a program 50 that can be executed by the computer 10 (see FIG. 5). The program may be a file containing intermediate code or an executable binary file.

また、算出部４２は、そのプログラム５０における様々なパラメータを最適化する機能ブロックである。そのパラメータとしては、図４（ａ）〜（ｃ）のようにbottom行列から切り出す小bottom行列dのサイズtがある。また、後述するバンクの個数qも最適化の対象となるパラメータの一例である。 Further, the calculation unit 42 is a functional block that optimizes various parameters in the program 50. As the parameter, there is a size t of a small bottom matrix d cut out from the bottom matrix as shown in FIGS. 4A to 4C. In addition, the number q of banks, which will be described later, is also an example of parameters to be optimized.

図２１は、プログラム５０を実行することにより実現される計算機１０の機能ブロック図である。 FIG. 21 is a functional block diagram of the computer 10 realized by executing the program 50.

図２１に示すように、計算機１０は、受付部５１、選択部５２、格納部５３、計算部５４、及び出力部５５を備える。これらの各部は、図５のメインメモリ１１とDLU１２が協働してプログラム５０を実行することにより実現される。 As shown in FIG. 21, the computer 10 includes a reception unit 51, a selection unit 52, a storage unit 53, a calculation unit 54, and an output unit 55. Each of these parts is realized by executing the program 50 in cooperation with the main memory 11 and the DLU 12 of FIG.

このうち、受付部５１は、bottom行列とweight行列の入力を受け付ける。また、選択部５２は、図４（ａ）〜（ｃ）に示したように、bottom行列からt×tの小bottom行列dを選択する。前述のようにサイズtの値は算出部４２によって最適化されており、最適化されたサイズtを利用して選択部５２が小bottom行列dを選択する。 Of these, the reception unit 51 accepts inputs of the bottom matrix and the weight matrix. Further, as shown in FIGS. 4A to 4C, the selection unit 52 selects a small bottom matrix d of t × t from the bottom matrix. As described above, the value of the size t is optimized by the calculation unit 42, and the selection unit 52 selects the small bottom matrix d using the optimized size t.

そして、格納部５３は、小bottom行列dとweight行列gのそれぞれの要素をDPE0〜DPE7の各バンクR#0〜R#7に格納する。 Then, the storage unit 53 stores the elements of the small bottom matrix d and the weight matrix g in the banks R # 0 to R # 7 of DPE0 to DPE7.

また、計算部５４は、各バンクR#0〜R#7に格納されたこれらの要素を用いて畳み込み計算を行う。出力部５５は、畳み込み計算の結果である小top行列y（図４（ａ）〜（ｃ）参照）を出力する。 Further, the calculation unit 54 performs a convolution calculation using these elements stored in the banks R # 0 to R # 7. The output unit 55 outputs a small top matrix y (see FIGS. 4A to 4C) which is the result of the convolution calculation.

次に、格納部５３の機能について詳細に説明する。
格納部５３は、メインメモリ１１から読み出した各配列の要素を各バンクR#0〜R#7に格納する機能ブロックであるが、どのように格納するかはフォワード処理とバックワード処理とで異なる。 Next, the function of the storage unit 53 will be described in detail.
The storage unit 53 is a functional block that stores the elements of each array read from the main memory 11 in each bank R # 0 to R # 7, but how to store them differs between forward processing and backward processing. ..

ここではフォワード処理について説明する。フォワード処理の場合、格納部５３は、メインメモリ１１から読み出した各配列の要素を次の式（５）のように並び替え、各要素をDPE0〜DPE7の各バンクR#0〜R#7に格納する。 Here, the forward processing will be described. In the case of forward processing, the storage unit 53 rearranges the elements of each array read from the main memory 11 as shown in the following equation (5), and arranges each element into each bank R # 0 to R # 7 of DPE0 to DPE7. Store.

なお、配列yは、小bottom行列dとweight行列gとを畳み込んで得られた小top行列の要素を格納するための配列である。そして、この例では、weight行列gが第１の行列の一例となり、t×tの小bottom行列dが第２の行列の一例となる。 The array y is an array for storing the elements of the small top matrix obtained by convolving the small bottom matrix d and the weight matrix g. Then, in this example, the weight matrix g is an example of the first matrix, and the small bottom matrix d of t × t is an example of the second matrix.

また、（Cinの個数）=（Cin_majorの個数）×（Cin_minorの個数）であり、組（Cin_major、Cin_minor）によって入力チャネル番号Cinを特定することができる。そこで、以下では組（Cin_major、Cin_minor）と入力チャネル番号Cinとを同一視する。例えば、Cin_major=0、Cin_minor=0の配列要素はCin=0に対応し、Cin_major=0、Cin_minor=1の配列要素はCin=1に対応する。 Further, a (Cin number of) = (Cin number of _major) × (number of Cin _minor), it is possible to identify the input channel number Cin set (Cin _major, Cin _minor) by. Therefore, in the following, the set (Cin _major , Cin _minor ) and the input channel number Cin are equated. For example, an array element with Cin _major = 0 and Cin _minor = 0 corresponds to Cin = 0, and an array element with Cin _major = 0 and Cin _minor = 1 corresponds to Cin = 1.

同様に、（Nの個数）=（N_majorの個数）×（N_minorの個数）であり、組（N_major、N_minor）によってバッチ数Nを特定することができるため、以下では組（N_major、N_minor）とバッチ数Nとを同一視する。例えば、N_major=0、N_minor=0の配列要素はN=0に対応し、N_major=0、N_minor=1の配列要素はN=1に対応する。 Similarly, (number of N) = (number of N _major ) × (number of N _minor ), and the number of batches N can be specified by the set (N _major , N _minor ). _{Equalize major} , N _minor ) with the number of batches N. For example, the array elements of N _major = 0 and N _minor = 0 correspond to N = 0, and the array elements of N _major = 0 and N _minor = 1 correspond to N = 1.

式（５）によれば、入力チャネル番号Cinとバッチ数Nとを特定することで一つの小bottom行列dを特定することができる。この例における入力チャネル番号Cinは、このように小bottom行列dを特定する第１の識別子の一例である。同様に、この例におけるバッチ数Nは、小bottom行列dを特定する第２の識別子の一例である。 According to the equation (5), one small bottom matrix d can be specified by specifying the input channel number Cin and the number of batches N. The input channel number Cin in this example is an example of the first identifier that identifies the small bottom matrix d in this way. Similarly, the number of batches N in this example is an example of a second identifier that identifies the small bottom matrix d.

また、この例では、Cin_minorの総数を4とし、N_minorの総数を16とする。更に、Cin_majorの総数は1とし、N_majorの総数は4とする。これにより、図２のように全部で256個ある入力チャネル番号Cinのうちの4(=1×4)個と、64(=4×16)個のバッチ数の各々で特定されるbottom行列に対して畳み込み計算が行われる。 In this example, the 4 total number of Cin _minor, and the total number of N _minor 16. Furthermore, the total number of Cin _major was 1, the total number of N _major is four. As a result, as shown in Fig. 2, 4 (= 1 × 4) of the 256 input channel number Cins in total and 64 (= 4 × 16) batch numbers are specified in the bottom matrix. On the other hand, the convolution calculation is performed.

更に、配列dにおける要素[H][W]は、t×tの小bottom行列dの各要素に対応する。 Furthermore, the elements [H] [W] in the array d correspond to each element of the small bottom matrix d of t × t.

一方、配列gの要素[H’][W’]は、３×３のweight行列gの各要素に対応する。また、配列gの入力チャネル番号Cinの総数は、配列dの入力チャネル番号と同じ4個とする。そして、出力チャネル番号Coutの総数は8個とする。 On the other hand, the elements [H'] [W'] of the array g correspond to each element of the weight matrix g of 3 × 3. The total number of input channel numbers Cin in the array g is four, which is the same as the input channel numbers in the array d. The total number of output channel number Couts is eight.

図２２は、フォワード処理を行う場合に、格納部５３によって各配列d、gが格納されたDPE0〜DPE7の各々のレジスタG#0の中身を示す図である。 FIG. 22 is a diagram showing the contents of each register G # 0 of DPE0 to DPE7 in which the arrays d and g are stored by the storage unit 53 when the forward processing is performed.

DPE0においては、複数の計算コアの各々が、バンクR#0〜R#7のうちで自身に対応するバンクに格納された各行列d、gの間で畳み込み計算を行う。その畳み込み計算は、複数の計算コアの各々で並列実行されるため、畳み込み計算を高速化することができる。これについてはDPE0〜DPE7においても同様である。 In DPE0, each of the plurality of calculation cores performs a convolution calculation between the matrices d and g stored in the banks corresponding to the banks R # 0 to R # 7. Since the convolution calculation is executed in parallel in each of the plurality of calculation cores, the convolution calculation can be speeded up. This also applies to DPE0 to DPE7.

また、配列d、gのうち、配列dについては、図１３と同様にシーケンシャル方式でDPE0〜DPE7の各バンクR#0〜R#7に格納する。ここでは、Cin_majorが同一の配列dのみを各バンクR#0〜R#7に一度に格納する。そして、その配列dの畳み込みが終了した後に、Cin_majorが別の値の配列dを各バンクR#0〜R#7に格納する。図２２は、Cin_major=0の配列dを各バンクR#0〜R#7に格納した場合を想定している。 Further, among the arrays d and g, the array d is stored in each bank R # 0 to R # 7 of DPE0 to DPE7 in a sequential manner as in FIG. Here, only the array d having the same Cin _major is stored in each bank R # 0 to R # 7 at a time. Then, after the convolution of the array d is completed, Cin _major stores the array d having another value in each bank R # 0 to R # 7. FIG. 22 assumes a case where the array d of Cin _major = 0 is stored in each bank R # 0 to R # 7.

このとき、本実施形態では、式（５）のように配列dの最下位にCin_minorを記述し、その上位にN_minorを記述したため、N_minorが同一の範囲で各バンクとCin_minorとが一対一に対応する。そのため、Cin_minorの総数をq(=4)個とすると、一つのDPEにおけるq個のバンクの各々には、入力チャネル番号（Cin_major、Cin_minor）が相互に異なり、かつバッチ数（N_major、N_minor）が同一のq個の小bottom行列dが格納されることになる。 At this time, in the present embodiment, since Cin _minor is described at the bottom of the array d and N _minor is described at the top of the array d as in the equation (5), each bank and Cin _minor have the same range of N _minor. There is a one-to-one correspondence. Therefore, if the total number of Cin _minor q (= 4) pieces that, each of q banks in one DPE, input channel number (Cin _major, Cin _minor) is different from each other, and the number of batch (N _major , N _minor ) will store q small bottom matrices d with the same.

例えば、DPE0においては、R#0〜R#3の４(=q)個のバンクの各々に、バッチ数Nが（0、0）であり、かつ入力チャネル番号Cinが（0、0）、（0、1）、（0、2）、（0、3）である４個の小bottom行列dが格納される。 For example, in DPE0, the number of batches N is (0,0) and the input channel number Cin is (0,0) in each of the 4 (= q) banks of R # 0 to R # 3. Four small bottom matrices d, which are (0, 1), (0, 2), and (0, 3), are stored.

これにより、図１３のようにバンクR#0〜R#7ごとにバッチ数Nを変える例とは異なり、同一のバッチ数Nを有するq個の小bottom行列dの畳み込み計算をq個の計算コアが並列して実行することができる。 As a result, unlike the example in which the batch number N is changed for each bank R # 0 to R # 7 as shown in FIG. 13, the convolution calculation of q small bottom matrices d having the same batch number N is calculated as q. Cores can run in parallel.

一方、weight行列gについては、格納部５３が図１３の例と同様にマルチキャスト方式によりメインメモリ１１から各DPE0〜DPE7の各バンクに格納する。 On the other hand, the weight matrix g is stored in the storage unit 53 from the main memory 11 in each bank of DPE0 to DPE7 by a multicast method as in the example of FIG.

ここでは、格納部５３は、小bottom行列dと同一の入力チャネル番号Cinを有するweight行列gを各DPE0〜DPE7の各バンクに格納する。このように入力チャネル番号Cinが相互に等しい行列d、gを同一のバンクに格納することにより、計算部５４が、図２のように同一の入力チャネル番号Cinが相互に等しい行列d、g同士の畳み込み計算を行うことができる。 Here, the storage unit 53 stores the weight matrix g having the same input channel number Cin as the small bottom matrix d in each bank of DPE0 to DPE7. By storing the matrices d and g having the same input channel numbers Cin in the same bank in this way, the calculation unit 54 tells the matrices d and g having the same input channel numbers Cin to each other as shown in FIG. Convolution calculation can be performed.

但し、マルチキャスト方式で配列gを各バンクに転送すると、図１５を参照して説明したように、一つのバンクにおける入力チャネル番号Cinと出力チャネルCoutの規則性が低下する。そこで、本実施形態では、Winogradアルゴリズムで畳み込み計算を行うときに、計算部５４が以下のようにして配列gの要素を整列させる。 However, when the array g is transferred to each bank by the multicast method, the regularity of the input channel number Cin and the output channel Cout in one bank is reduced as described with reference to FIG. Therefore, in the present embodiment, when the convolution calculation is performed by the Winograd algorithm, the calculation unit 54 aligns the elements of the array g as follows.

図２３〜図２５は、計算部５４がWinogradアルゴリズムで畳み込み計算を行う場合のDPE0の各レジスタG#0〜G#3の中身を示す図である。なお、図２３〜図２５では、図が煩雑になるのを避けるために、レジスタG#0〜G#3のバンクR#0のみを示している。 23 to 25 are diagrams showing the contents of each register G # 0 to G # 3 of DPE0 when the calculation unit 54 performs the convolution calculation by the Winograd algorithm. Note that, in FIGS. 23 to 25, only the bank R # 0 of the registers G # 0 to G # 3 is shown in order to avoid complicating the figure.

畳み込み計算を行う前は、図２３（ａ）に示すように、レジスタG#0のバンクR#0に配列d、gの各要素が格納されている。このうち、配列dとしては、前述のようにバッチ数N（＝（N_major、N_minor））が異なる複数の配列dがバンクR#0に格納されている。 Before the convolution calculation is performed, as shown in FIG. 23A, the elements of the arrays d and g are stored in the bank R # 0 of the register G # 0. Of these, as the array d, as described above, a plurality of arrays d having different batch numbers N (= (N _major , N _minor )) are stored in bank R # 0.

次に、式（２）に従って、配列dの両側から行列B^T、Bを乗算し、その結果である行列B^TdBを配列dと同じラインに格納する。なお、行列B^T、Bの各要素は、バンクR#0の定数領域cstに格納されている。 Next, according to equation (2), the matrices B ^T and B are multiplied from both sides of the array d, and the resulting matrix B ^T dB is stored in the same line as the array d. The elements of the matrices B ^T and B are stored in the constant area cst of bank R # 0.

また、この段階では、weight行列を表す配列gは、図１５のように規則性が乱れた状態となっている。 Further, at this stage, the array g representing the weight matrix is in a state in which the regularity is disturbed as shown in FIG.

そこで、次のステップでは、図２３（ｂ）に示すように、レジスタG#0のバンクR#0に格納されている配列gの各要素を、レジスタG#3のバンクR#0に転送しながら各要素を整列させる。 Therefore, in the next step, as shown in FIG. 23B, each element of the array g stored in the bank R # 0 of the register G # 0 is transferred to the bank R # 0 of the register G # 3. While aligning each element.

整列後のレジスタの中身は、図１６に示したように、バンクR#0〜R#7と出力チャネル番号Coutとが一対一に対応しており、バンクR#0にはCout=0の要素のみが格納される。 As shown in FIG. 16, the contents of the aligned register have a one-to-one correspondence between banks R # 0 to R # 7 and the output channel number Cout, and bank R # 0 has an element of Cout = 0. Only stored.

次に、図２４に示すように、式（２）に従って、配列gの両側から行列G、G^Tを乗算し、その結果である行列GgG^Tを同じバンクの空き領域に格納する。なお、行列G、G^Tの各要素は、バンクR#0の定数領域cstに格納されている。 Next, as shown in FIG. 24, according to equation (2), multiplied by the matrix G, G ^T from both sides of the sequence g, stores the the result matrix GGG ^T in the free space of the same bank. Each element of the matrix G and G ^T is stored in the constant area cst of bank R # 0.

次いで、図２５に示すように、レジスタG#0のバンクR#0にある二つの行列B^TdBと、レジスタG#3のバンクR#0にある一つの行列GdG^Tとに対して、式（２）の要素ごとの乗算「◎」を行う。 Then, as shown in FIG. 25, the two matrices B ^T dB of the bank R # 0 of the register G # 0, with respect to a single matrix GDG ^T of the bank R # 0 of the register G # 3, the formula Perform the multiplication "◎" for each element in (2).

なお、畳み込み計算は、図２を参照して説明したように入力チャネル番号Cinが同じ二つの行列に対して行う。よって、レジスタG#3のバンクR#0にある四つの行列GdG^TのうちのCin=0の行列と、レジスタG#0のバンクR#0にあるCin_minor=0の二つの行列B^TdBを用いて要素ごとの乗算「◎」を行う。 The convolution calculation is performed on two matrices having the same input channel number Cin as described with reference to FIG. Therefore, the matrix of Cin = 0 of the four matrices GdG ^{T in} bank R # 0 of register G # 3 and the two matrices B ^T dB of Cin _minor = 0 in bank R # 0 of register G # 0. Perform the multiplication "◎" for each element using.

この後は、式（２）に従って[GgG^T]◎[B^TdB]の両側から行列A^T、Aを乗算し、小top行列yを得る。 After this, multiply the matrices A ^T and A from both sides of [GgG ^T ] ◎ [B ^T dB] according to Eq. (2) to obtain a small top matrix y.

以上により、計算部５４が行うWinogradアルゴリズムを用いた畳み込み計算を終了する。 As described above, the convolution calculation using the Winograd algorithm performed by the calculation unit 54 is completed.

このような畳み込み計算によれば、図２３（ａ）に示したように、レジスタG#0のバンクR#0に、バッチ数N（=（N_minor、N_major））が異なるbottom行列を格納する。 According to such a convolution calculation, as shown in FIG. 23A, a bottom matrix having a different number of batches N (= (N _minor , N _major )) is stored in the bank R # 0 of the register G # 0. To do.

これにより、図１７のようにバッチ数Nが同一で入力チャネル番号Cinが異なる複数の小bottom行列dを同一のバンクに格納する例と比較して、一つのバンクに格納する小bottom行列dの個数を減らすことができる。その結果、bottom行列dのサイズtを大きくすることができ、Winogradアルゴリズムで畳み込み計算を高速に行うことが可能となる。 As a result, as compared with the example of storing a plurality of small bottom matrices d having the same number of batches N but different input channel numbers Cin in the same bank as shown in FIG. 17, the small bottom matrix d stored in one bank The number can be reduced. As a result, the size t of the bottom matrix d can be increased, and the convolution calculation can be performed at high speed by the Winograd algorithm.

t=6の場合について本願発明者が試算したところ、Winogradアルゴリズムを使用しない図３（ａ）〜（ｃ）の例では、畳み込みに要する計算時間が２３０４サイクルとなった。一方、本実施形態ではその計算時間は１２６４サイクルとなり、１．８２（＝２３０４／１２６４）倍の高速化が図られることが明らかとなった。 As a result of a trial calculation by the inventor of the present application for the case of t = 6, in the examples of FIGS. 3A to 3C which do not use the Winograd algorithm, the calculation time required for convolution was 2304 cycles. On the other hand, in the present embodiment, the calculation time is 1264 cycles, and it has been clarified that the speed can be increased by 1.82 (= 2304/1264) times.

畳み込み計算を更に高速に行うにはtの値をなるべく大きくすればよいが、tを大きくし過ぎるとバンクR#0〜R#7の各々に小bottom行列dを格納することができなくなってしまう。一方、tの値が小さいとバンクR#0〜R#7の各々に小bottom行列dを確実に格納できるものの、畳み込み計算の計算時間が長くなってしまう。 To perform the convolution calculation even faster, the value of t should be made as large as possible, but if t is made too large, it will not be possible to store the small bottom matrix d in each of banks R # 0 to R # 7. .. On the other hand, if the value of t is small, the small bottom matrix d can be reliably stored in each of the banks R # 0 to R # 7, but the calculation time of the convolution calculation becomes long.

そこで、本実施形態では、以下のようにして最適なtの値を求める。
まず、各パラメータを次のように定義する。
p: 一つのDPEにおけるバンクの個数
q: 一つのDPEにおいて、同一のN_minorを有する小bottom行列dが格納されるバンクの個数
R: 一つのバンクに格納できるデータの個数 Therefore, in the present embodiment, the optimum value of t is obtained as follows.
First, each parameter is defined as follows.
p: Number of banks in one DPE
q: Number of banks in which small bottom matrix d with the same N _minor is stored in one DPE
R: Number of data that can be stored in one bank

図２２の例の場合、これらのパラメータの具体的な値は次のようになる。 In the case of the example of FIG. 22, the specific values of these parameters are as follows.

p: ８個
q: ４個
R １２８個 p: 8
q: 4 pieces
R 128 pieces

更に、次のパラメータを定義する。
Cin’: DPE0で一度に処理する入力チャネル番号Cinの個数
Cout’: DPE0で一度に処理する出力チャネル番号Coutの個数
N’: DPE0で一度に処理するバッチ数Nの個数
これらのパラメータについて、図２２の例を参照しながら説明する。 In addition, the following parameters are defined.
Cin': Number of input channel numbers Cin processed by DPE0 at one time
Cout': Number of output channel numbers Cout to be processed by DPE0 at one time
N': Number of batches to be processed by DPE0 at one time Number of batches N These parameters will be described with reference to the example of FIG.

Cin’は、上記のようにDPE0で一度に処理する入力チャネル番号Cinの個数である。入力チャネル番号Cinは組（Cin_major、Cin_minor）で特定されるが、図２２の例では（Cin_major、Cin_minor）＝（0、0）、（0、1）、（0、2）、（0、3）の配列g、dのみをDPE0で処理しているため、Cin’=4となる。 Cin'is the number of input channel number Cins processed at one time by DPE0 as described above. The input channel number Cin is specified as a set (Cin _major , Cin _minor ), but in the example of FIG. 22, (Cin _major , Cin _minor ) = (0, 0), (0, 1), (0, 2), Since only the arrays g and d of (0, 3) are processed by DPE0, Cin'= 4.

一方、Cout’は、上記のようにDPE0で一度に処理する出力チャネル番号Coutの個数である。図２２の例ではCoutの値が0〜7の８個のweight行列gがDPE0に格納されているため、Cout’=8となる。 On the other hand, Cout'is the number of output channel number Couts processed at one time by DPE0 as described above. In the example of FIG. 22, since eight weight matrices g having Cout values 0 to 7 are stored in DPE0, Cout ′ = 8.

また、N’は、上記のようにDPE0で一度に処理するバッチ数Nの個数である。図２２の例では、組（N_major、N_minor）が（0、0）、（0、1）、（1、0）、（1、1）の４個の小bottom行列dがDPE0で処理されているため、N’=4となる。
次に、畳み込み計算の計算時間について検討する。 Further, N'is the number of batches N processed by DPE0 at one time as described above. In the example of FIG. 22, four small bottom matrices d in which the set (N _major , N _minor ) is (0, 0), (0, 1), (1, 0), (1, 1) are processed by DPE0. Therefore, N'= 4.
Next, the calculation time of the convolution calculation will be examined.

まず、図２３（ａ）のようにt×tの小bottom行列dから行列B^TdBを求める場合の計算時間について検討する。行列B^TdBを求めるには、例えば最初にB^Tdを計算し、その計算結果に右から行列Bをかければよい。また、B^Tdを計算するには、t×tの小bottom行列dをt個の列ベクトルに分解し、その列ベクトルと行列B^Tとの積を求めればよい。 First, the calculation time when the matrix B ^T dB is obtained from the small bottom matrix d of t × t as shown in FIG. 23 (a) will be examined. To find the matrix B ^T dB, for example, first calculate B ^T d and then multiply the calculation result by the matrix B from the right. To calculate B ^T d, the small bottom matrix d of t × t can be decomposed into t column vectors, and the product of the column vector and the matrix B ^T can be obtained.

そこで、この例では、t×tの小bottom行列dを構成するt個の列ベクトルのうちの一つと行列B^Tとの積を求めるときに要する計算時間をb(t)と書く。その関数b(t)を用いると、一つのDPEでB^TdBを求めるのに要する計算時間は次の式（６）のように書ける。 Therefore, in this example, the calculation time required to obtain the product of one of the t column vectors constituting the small bottom matrix d of t × t and the matrix B ^T is written as b (t). Using the function b (t), the calculation time required to obtain B ^T dB with one DPE can be written as the following equation (6).

式（６）に「t」を含めたのは、B^Tdを求めるときに、小bottom行列dのt個の列ベクトルを行列B^Tに乗ずる必要があるため、関数b(t)が表す計算時間よりもt倍長い計算時間が必要になることを考慮してのことである。同様に、行列B^TdとBとの積を求める場合にも、行列Bのt個の列ベクトルを行列B^Tdに乗じる必要がある。よって、トータルの計算時間は関数b(t)が表す計算時間のt+t倍となるため、因子「t+t」を式（６）に含めた。 The reason why "t" is included in the equation (6) is expressed by the function b (t) because it is necessary to multiply the matrix B ^T by t column vectors of the small bottom matrix d when finding B ^T d. This is in consideration of the fact that a calculation time that is t times longer than the calculation time is required. Similarly, when finding the product of the matrix B ^T d and B, it is necessary to multiply the matrix B ^T d by the t column vectors of the matrix B. Therefore, since the total calculation time is t + t times the calculation time represented by the function b (t), the factor "t + t" is included in the equation (6).

また、図２２に示したように、一つのDPEには全部でCin’・N’個の小bottom行列dがあるから、一つのバンクあたりの小bottom行列dの個数はCin’・N’/q個となる。計算コアC#0〜C#7の各々は、自身に対応する一つのバンクにあるCin’・N’/q個の小bottom行列dの各々についてB^TdBを求める必要があるため、式（６）に因子Cin’・N’/qを含めた。 Further, as shown in FIG. 22, since one DPE has a total of Cin'・ N'small bottom matrix d, the number of small bottom matrices d per bank is Cin'・ N'/. There are q pieces. Since each of the calculation cores C # 0 to C # 7 needs to obtain B ^T dB for each of the Cin'・ N'/ q small bottom matrices d in one bank corresponding to itself, the equation ( Factors Cin'and N'/ q were included in 6).

次に、図２４のように３×３のweight行列gから行列GgG^Tを求める場合の計算時間について検討する。 Next, consider the computation time for obtaining the matrix GGG ^T from the 3 × 3 weight matrix g as shown in Figure 24.

行列GgG^Tを求めるには、例えば最初にGgを計算し、その計算結果に右から行列G^Tをかければよい。また、Ggを計算するには、weight行列gを３個の列ベクトルに分解し、その列ベクトルと行列Gとの積を求めればよい。 To determine the matrix GGG ^T, for example first calculating the Gg,, multiply the matrix G ^T from the right to the calculation result. Further, in order to calculate Gg, the weight matrix g may be decomposed into three column vectors, and the product of the column vector and the matrix G may be obtained.

そこで、この例では、３×３のweight行列gを構成する３個の列ベクトルのうちの一つと行列Gとの積を求めるときに要する計算時間をw(t)と書く。その関数w(t)を用いると、一つのDPEでGgG^Tを求めるのに要する計算時間は次の式（７）のように書ける。 Therefore, in this example, the calculation time required to obtain the product of one of the three column vectors constituting the 3 × 3 weight matrix g and the matrix G is written as w (t). Using the function w (t), the calculation time required to obtain GgG ^T with one DPE can be written as the following equation (7).

式（７）に「3」を含めたのは、行列Ggを求めるときに、weight行列gの３個の列ベクトルを行列Gに乗ずる必要があるため、関数w(t)が表す計算時間よりも３倍長い計算時間が必要になることを考慮してのことである。 The reason why "3" is included in the equation (7) is that when the matrix Gg is calculated, it is necessary to multiply the matrix G by the three column vectors of the weight matrix g, so the calculation time represented by the function w (t) This is in consideration of the fact that the calculation time is three times longer.

また、行列Ggと行列G^Tとの積を求める場合には、行列G^Tのt個の列ベクトルを行列Ggに乗じる必要がある。よって、トータルの計算時間は関数w(t)が表す計算時間よりもt+3倍だけ長くなるため、因子「t+3」を式（７）に含めた。 When finding the product of the matrix G g and the matrix G ^T , it is necessary to multiply the matrix G g by the t column vectors of the matrix G ^T. Therefore, since the total calculation time is t + 3 times longer than the calculation time represented by the function w (t), the factor "t + 3" is included in the equation (7).

また、図２２に示したように、一つのDPEには全部でCin’・Cout’個のweight行列gがあるから、一つのバンクあたりのweight行列gの個数はCin’・Cout’/p個となる。計算コアC#0〜C#7の各々は、自身に対応する一つのバンクにあるCin’・Cout’/p個の小bottom行列dの各々についてGgG^Tを求める必要があるため、式（７）に因子Cin’・Cout’/pを含めた。 Further, as shown in FIG. 22, since one DPE has a total of Cin'・ Cout' weight matrices g, the number of weight matrices g per bank is Cin'・ Cout' / p. It becomes. Since each of the calculation cores C # 0 to C # 7 needs to find GgG ^T for each of the Cin'・ Cout' / p small bottom matrices d in one bank corresponding to itself, the equation (7) ) Includes the factors Cin'and Cout' / p.

次に、図２５のように行列B^TdBとGgG^Tとの要素ごとの乗算を行うのに要する計算時間について検討する。 Next, as shown in FIG. 25, the calculation time required to multiply the matrix B ^T dB and G g G ^T for each element will be examined.

図２２に示したように、一つのDPEに格納される小bottom行列dの個数はN’・Cin’・Cout’/pとなる。また、小bottom行列dの要素数はt²である。よって、行列B^TdBとGgG^Tのそれぞれの要素ごとを乗算するときの乗算の回数は、次の式（８）で表される。 As shown in FIG. 22, the number of small bottom matrices d stored in one DPE is N'・ Cin' ・ Cout' / p. The number of elements in the small bottom matrix d is t ² . Therefore, the number of multiplications when multiplying each element of the matrices B ^T dB and G g G ^T is expressed by the following equation (8).

式（６）〜（８）は、N個のバッチ数からN’個を選択し、Cout個の出力チャネル番号からCout’個を選択し、Cin個の入力チャネル番号からCin’個を選択した場合の計算時間である。よって、図２の全てのbottom行列とweight行列との畳み込み計算を行うには、更に次の式（９）の回数だけ計算を行う必要がある。 In equations (6) to (8), N'from the number of N batches was selected, Cout' was selected from the output channel numbers of Cout, and Cin' was selected from the input channel numbers of Cin. The calculation time for the case. Therefore, in order to perform the convolution calculation of all the bottom matrix and the weight matrix of FIG. 2, it is necessary to further perform the calculation as many times as the following equation (9).

なお、式（９）における因子HW/(t-2)²は、H×Wのbottom行列からt×tの小行列を切り出すときの切り出し方の総数を表す。 The factor HW / (t-2) ² in Eq. (9) represents the total number of cutting methods when cutting out a submatrix of t × t from the bottom matrix of H × W.

以上の式（６）〜（９）によれば、計算時間は、tだけでなくqにも依存する。そこで、本実施形態では一つのDPEで畳み込み計算をするときの計算時間を第１の関数f(t,q)で表す。第１の関数f(t,q)は、式（６）〜（７）の和に式（９）を乗じることにより以下の式（１０）のように表すことができる。 According to the above equations (6) to (9), the calculation time depends not only on t but also on q. Therefore, in this embodiment, the calculation time when the convolution calculation is performed with one DPE is represented by the first function f (t, q). The first function f (t, q) can be expressed as the following equation (10) by multiplying the sum of equations (6) to (7) by equation (9).

畳み込みに必要な計算時間を短くするには、weight行列gと小bottom行列dの各々の要素数がレジスタに格納可能な個数を超えない条件下で、第１の関数f(t,q)の値を最小とするようなtとqの組み合わせを見つければよい。 To reduce the computational time required for convolution, the first function f (t, q), provided that the number of elements in each of the weight matrix g and the small bottom matrix d does not exceed the number that can be stored in the register. Find the combination of t and q that minimizes the value.

そこで、次に小bottom行列dとweight行列gの各々の要素数について検討する。
まず、小bottom行列dの要素数について説明する。 Therefore, next, the number of elements of each of the small bottom matrix d and the weight matrix g will be examined.
First, the number of elements of the small bottom matrix d will be described.

一つのDPEの一つのバンクにおける小bottom行列dの要素数E_bは次の式（１１）で表すことができる。 The number of elements E _b of the small bottom matrix d in one bank of one DPE can be expressed by the following equation (11).

式（１１）において、t²は、一つの小bottom行列dの要素数である。また、Cin’・N’/qは、一つのバンクに格納される小bottom行列dの個数である。 In equation (11), t ² is the number of elements in one small bottom matrix d. Cin'・ N'/ q is the number of small bottom matrices d stored in one bank.

一方、一つのDPEの一つのバンクにおけるweight行列gの要素数E_wは次の式（１２）で表すことができる。 On the other hand, the number of elements E _w of the weight matrix g in one bank of one DPE can be expressed by the following equation (12).

式（１２）において、3²は、一つのweight行列gの要素数である。また、Cin’・Cout’/pは、一つのバンクに格納されるweight行列gの個数である。 In equation (12), 3 ² is the number of elements in one weight matrix g. Cin'and Cout' / p are the number of weight matrices g stored in one bank.

式（１１）と式（１２）より、小bottom行列dとweight行列gの各々の要素の総数を表す第２の関数g(t,q)は次の式（１３）のように書ける。 From equations (11) and (12), the second function g (t, q) representing the total number of elements of each of the small bottom matrix d and the weight matrix g can be written as the following equation (13).

前述のように一つのバンクに格納できるデータの個数をRとすると、次の式（１４）の制約条件が得られる。 Assuming that the number of data that can be stored in one bank is R as described above, the constraint condition of the following equation (14) can be obtained.

以上により、式（１４）の制約条件を満たすt,qの組み合わせのうちで、式（１０）の第１の関数f(t,q)の値を最小とするようなtとqの組み合わせを見つけることにより畳み込み計算を高速化できることになる。 As described above, among the combinations of t and q that satisfy the constraint condition of the equation (14), the combination of t and q that minimizes the value of the first function f (t, q) of the equation (10) is selected. By finding it, the convolution calculation can be speeded up.

そこで、本実施形態では、算出部４２が式（１４）の制約条件を満たすt,qの組み合わせのうちで、式（１０）の第１の関数f(t,q)の値を最小とするようなtとqの組み合わせを算出する。 Therefore, in the present embodiment, the calculation unit 42 minimizes the value of the first function f (t, q) of the equation (10) among the combinations of t and q that satisfy the constraint condition of the equation (14). Calculate the combination of t and q like this.

なお、本実施形態ではR=128であり、式（１４）を満たすtとqの候補の数はそれほど多くはない。そのため、算出部４２は、全探索で式（１４）を満たすtとqの組み合わせを見つけ出し、これらのうちで式（１０）の第１の関数f(t,q)の値を最小とするものを特定することができる。 In this embodiment, R = 128, and the number of candidates for t and q satisfying the equation (14) is not so large. Therefore, the calculation unit 42 finds a combination of t and q that satisfies the equation (14) in the full search, and minimizes the value of the first function f (t, q) of the equation (10) among them. Can be identified.

ところで、式（１０）では、b(t)とw(t)を既知の関数として扱った。b(t)とw(t)は、以下のようにして求めることができる。 By the way, in equation (10), b (t) and w (t) are treated as known functions. b (t) and w (t) can be obtained as follows.

まず、w(t)の求め方について説明する。前述のように、w(t)は、Ggを計算するときに、３×３のweight行列gを構成する３個の列ベクトルのうちの一つと行列Gとの積を求めるときに要する計算時間である。t=6のとき、行列Gの各要素は次の式（１５）のようになる。 First, how to obtain w (t) will be described. As described above, w (t) is the calculation time required to calculate the product of one of the three column vectors constituting the 3 × 3 weight matrix g and the matrix G when calculating Gg. Is. When t = 6, each element of the matrix G becomes as shown in the following equation (15).

この行列Gは次の式（１６）のように変形できる。

This matrix G can be transformed by the following equation (16).

式（１６）の右辺の二つの行列を以下の式（１７）、（１８）のようにおく。 The two matrices on the right side of equation (16) are set as in equations (17) and (18) below.

よって、Ggを計算するには、最初にG’gを計算し、その結果に左からG”をかければよいことになる。そこで、G’gの計算方法について説明する。 Therefore, in order to calculate Gg, it is sufficient to calculate G'g first and then multiply the result by G "from the left. Therefore, the calculation method of G'g will be described.

３×３のweight行列gの一つの列g’を以下では(g₀,g₁,g₂)^Tと書く。すると、G’g’は次の式（１９）のように書ける。 In the following, one column g'of a 3 × 3 weight matrix g is written as (g ₀ , g ₁ , g ₂ ) ^T. Then, G'g'can be written as the following equation (19).

なお、(x₀,x₁,x₂,x₃,x₄,x₅)^Tは、G’g’の各要素を格納する変数である。 Note that (x ₀ , x ₁ , x ₂ , x ₃ , x ₄ , x ₅ ) ^T is a variable that stores each element of G'g'.

ここで、式（１９）の計算を行うために、６個の配列要素a[0]、a[1]、a[2]、a[3]、a[4]、a[5]を用意する。そして、a[0]、a[1]、a[2]の各々にg₀、g₁、g₂を格納しておく。そして、計算用のバッファとして２個の配列要素b[0]、b[1]を用意する。 Here, in order to calculate the equation (19), six array elements a [0], a [1], a [2], a [3], a [4], and a [5] are prepared. To do. Then, g ₀ , g ₁ , and g ₂ are stored in each of a [0], a [1], and a [2]. Then, two array elements b [0] and b [1] are prepared as buffers for calculation.

このとき、式（１９）の計算は、図２６の順序で各配列要素に値を代入することにより実現できる。 At this time, the calculation of the equation (19) can be realized by assigning values to each array element in the order shown in FIG.

図２６は、式（１９）の計算をステップ順に示す模式図である。なお、図２６における「//」は、各ステップの意味を表すコメント文である。これについては後述の図２７においても同様である。 FIG. 26 is a schematic diagram showing the calculation of the equation (19) in step order. In addition, "//" in FIG. 26 is a comment sentence which expresses the meaning of each step. This also applies to FIG. 27, which will be described later.

図２６に示す手順で計算を行うと、最終的に(a[0]、a[1]、a[2]、a[3]、a[4]、a[5]) = (x₀,x₁,x₅,x₂,x₄,x₃)となり、配列要素a[0]、a[1]、a[2]、a[3]、a[4]、a[5]の各々にG’g’の計算結果を格納することができる。 When the calculation is performed according to the procedure shown in FIG. 26, finally (a [0], a [1], a [2], a [3], a [4], a [5]) = (x ₀ , x ₁ , x ₅ , x ₂ , x ₄ , x ₃ ), and each of the array elements a [0], a [1], a [2], a [3], a [4], a [5] The calculation result of G'g'can be stored in.

そして、G’g’の計算は、８ステップで行うことができる。よって、w(6)=8となる。tの値が6とは異なる場合もこれと同様にしてw(t)の値を求めることができる。 Then, the calculation of G'g'can be performed in 8 steps. Therefore, w (6) = 8. If the value of t is different from 6, the value of w (t) can be obtained in the same way.

次に、b(t)の求め方について説明する。前述のように、b(t)は、t×tの小bottom行列dを構成するt個の列ベクトルのうちの一つと行列B^Tとの積B^Tdを求めるときに要する計算時間である。t=6のとき、行列B^Tの各要素は次の式（２０）のようになる。 Next, how to obtain b (t) will be described. As described above, b (t) is the calculation time required to obtain the product B ^T d of one of the t column vectors constituting the small bottom matrix d of t × t and the matrix B ^T. .. When t = 6, each element of the matrix B ^T becomes as shown in the following equation (20).

また、６×６の小bottom行列dの一つの列d’を以下では(d₀,d₁,d₂,d₃,d₄,d₅)^Tと書く。このとき、B^Td’は次の式（２１）のように書ける。 Further, one column d'of a 6 × 6 small bottom matrix d is written as (d ₀ , d ₁ , d ₂ , d ₃ , d ₄ , d ₅ ) ^T below. At this time, B ^T d'can be written as the following equation (21).

なお、(x₀,x₁,x₂,x₃,x₄,x₅)^Tは、B^Td’の各要素を格納する変数である。 Note that (x ₀ , x ₁ , x ₂ , x ₃ , x ₄ , x ₅ ) ^T is a variable that stores each element of B ^T d'.

ここで、式（２１）の計算を行うために、６個の配列要素a[0]、a[1]、a[2]、a[3]、a[4]、a[5]を用意し、この各々にd₀,d₁,d₂,d₃,d₄,d₅を予め格納しておく。 Here, in order to calculate the equation (21), six array elements a [0], a [1], a [2], a [3], a [4], and a [5] are prepared. Then, d ₀ , d ₁ , d ₂ , d ₃ , d ₄ , d ₅ are stored in each of them in advance.

そして、計算用のバッファとして４個の配列要素b[0]、b[1]、b[2]、b[3]を用意する。 Then, four array elements b [0], b [1], b [2], and b [3] are prepared as buffers for calculation.

このとき、式（２１）の計算は、図２７の順序で各配列要素に値を代入することにより実現できる。 At this time, the calculation of the equation (21) can be realized by substituting the values for each array element in the order shown in FIG.

図２７は、式（２１）の計算をステップ順に示す模式図である。
図２７に示す手順で計算を行うと、最終的に(a[0]、a[1]、a[2]、a[3]、a[4]、a[5]) = (x₀,x₁,x₂,x₃,x₄,x₅)となり、配列要素a[0]、a[1]、a[2]、a[3]、a[4]、a[5]の各々にB^Td’の計算結果を格納することができる。 FIG. 27 is a schematic diagram showing the calculation of the equation (21) in step order.
When the calculation is performed according to the procedure shown in FIG. 27, finally (a [0], a [1], a [2], a [3], a [4], a [5]) = (x ₀ , x ₁ , x ₂ , x ₃ , x ₄ , x ₅ ), and each of the array elements a [0], a [1], a [2], a [3], a [4], a [5] The calculation result of B ^T d'can be stored in.

そして、B^Td’の計算は、１５ステップで行うことができる。よって、b(6) = 15となる。tの値が6とは異なる場合もこれと同様にしてb(t)の値を求めることができる。 Then, the calculation of B ^T d'can be performed in 15 steps. Therefore, b (6) = 15. When the value of t is different from 6, the value of b (t) can be obtained in the same manner.

以上説明した事項に基づき、本実施形態に係る情報処理装置３１は以下のような情報処理方法を実行する。 Based on the matters described above, the information processing apparatus 31 according to the present embodiment executes the following information processing method.

図２８は、本実施形態に係る情報処理方法のフローチャートである。
まず、ステップS1において、算出部４２（図２０参照）がtとqの組み合わせを算出する。例えば、算出部４２は、式（１４）の制約条件を満たすtとqの組み合わせのうちで、式（１０）の第１の関数f(t,q)の値が最小となる組み合わせを算出する。これにより、weight行列gとt×tの小bottom行列dの要素をq個のバンクに格納できるtとqの組み合わせのうちで、計算時間が最小の組み合わせを得ることができる。 FIG. 28 is a flowchart of the information processing method according to the present embodiment.
First, in step S1, the calculation unit 42 (see FIG. 20) calculates the combination of t and q. For example, the calculation unit 42 calculates the combination of t and q that satisfies the constraint condition of the equation (14) that minimizes the value of the first function f (t, q) of the equation (10). .. As a result, among the combinations of t and q that can store the elements of the small bottom matrix d of the weight matrix g and t × t in q banks, the combination with the minimum calculation time can be obtained.

次に、ステップS2に移り、出力部４１（図２０参照）が、計算機１０（図５参照）で実行可能なプログラム５０を出力する。 Next, the process proceeds to step S2, and the output unit 41 (see FIG. 20) outputs a program 50 that can be executed by the computer 10 (see FIG. 5).

そのプログラム５０には、ステップS1で算出したtとqの組み合わせが使用される。例えば、計算機１０でプログラム５０を実行すると、選択部５２（図２１参照）がbottom行列からt×tの小bottom行列dを選択する。 The combination of t and q calculated in step S1 is used for the program 50. For example, when the program 50 is executed on the computer 10, the selection unit 52 (see FIG. 21) selects a small bottom matrix d of t × t from the bottom matrix.

そして、格納部５３が、DPE0のバンクR#0〜R#7のうちのq個のバンクの各々に、t×tの小bottom行列dとweight行列gを格納する。その後、計算部５４が、図２３〜図２５の手順に従って、Winogradアルゴリズムを用いて小bottom行列dとweight行列gとの畳み込み計算を行う。 Then, the storage unit 53 stores the small bottom matrix d of t × t and the weight matrix g in each of the q banks of the banks R # 0 to R # 7 of DPE0. After that, the calculation unit 54 performs a convolution calculation of the small bottom matrix d and the weight matrix g using the Winograd algorithm according to the procedure of FIGS. 23 to 25.

以上により、本実施形態に係る情報処理方法の基本ステップを終了する。 This completes the basic steps of the information processing method according to this embodiment.

上記した本実施形態によれば、小bottom行列dとweight行列gを一つのバンクに格納できるという式（１４）の制約条件の下で、畳み込み計算の計算時間を表す第１の関数f(t,q)が最小となるtとqの組み合わせを算出部４２が算出する。 According to the present embodiment described above, the first function f (t) representing the calculation time of the convolution calculation under the constraint condition of the equation (14) that the small bottom matrix d and the weight matrix g can be stored in one bank. The calculation unit 42 calculates the combination of t and q that minimizes, q).

そのため、レジスタのバンクに小bottom行列dとweight行列gを格納しつつ、これらの行列を用いて高速に畳み込み計算を行うことが可能となる。 Therefore, while storing the small bottom matrix d and the weight matrix g in the register bank, it is possible to perform convolution calculation at high speed using these matrices.

＜backword処理＞
図２２の例では、深層学習のフォワード処理における畳み込み計算をWinogradアルゴリズムで行った。 <Backword processing>
In the example of FIG. 22, the convolution calculation in the forward processing of deep learning was performed by the Winograd algorithm.

以下では、深層学習のバックワード処理におけるWinogradアルゴリズムについて説明する。backword処理には、top行列とweight行列とを畳み込んでbottom行列を得る処理と、top行列とbottom行列とを畳み込んでweight行列を得る処理がある。 The Winograd algorithm in the backward processing of deep learning will be described below. The backword process includes a process of convolving the top matrix and the weight matrix to obtain the bottom matrix, and a process of convolving the top matrix and the bottom matrix to obtain the weight matrix.

まず、前者のようにtop行列とweight行列との畳み込み計算によりbottom行列を得る処理について説明する。 First, the process of obtaining the bottom matrix by the convolution calculation of the top matrix and the weight matrix as in the former will be described.

図２９（ａ）〜（ｃ）は、backword処理において、top行列とweight行列との畳み込み計算をWinogradアルゴリズムで行うときの模式図である。 FIGS. 29 (a) to 29 (c) are schematic diagrams when the convolution calculation of the top matrix and the weight matrix is performed by the Winograd algorithm in the backward processing.

まず、図２９（ａ）に示すように、選択部５２（図２１参照）が、H行W列のtop行列からt×tの小top行列yを選択する。 First, as shown in FIG. 29 (a), the selection unit 52 (see FIG. 21) selects a small top matrix y of t × t from the top matrix of H rows and W columns.

次に、次の式（２２）に従い、計算部５４が、weight行列gと小top行列yとを畳み込むことにより小bottom行列dを求める。 Next, according to the following equation (22), the calculation unit 54 obtains the small bottom matrix d by convolving the weight matrix g and the small top matrix y.

次に、図２９（ｂ）に示すように、top行列から小top行列yを切り出す位置を図２９（ａ）の場合よりも２列ずらし、切り出した小top行列yに対して上記と同じ計算を行う。これにより得られた小bottom行列dは、botttom行列において、図２９（ａ）で得た小bottom行列dの隣のブロックを形成する。 Next, as shown in FIG. 29 (b), the position where the small top matrix y is cut out from the top matrix is shifted by two columns from the case of FIG. 29 (a), and the same calculation as above is performed for the cut out small top matrix y. I do. The small bottom matrix d thus obtained forms a block next to the small bottom matrix d obtained in FIG. 29 (a) in the bottom matrix.

このようにtop行列から小top行列yを切り出す位置を列方向と行方向に２個ずつずらすことにより、図２９（ｃ）に示すように、各小bottom行列dで形成されるbottom行列を得ることができる。 By shifting the positions for cutting out the small top matrix y from the top matrix by two in the column direction and the row direction in this way, a bottom matrix formed by each small bottom matrix d is obtained as shown in FIG. 29 (c). be able to.

以上により、backword処理におけるtop行列とweight行列との畳み込み計算を終える。この例では、weight行列gが第１の行列の一例となり、t×tの小top行列yが第２の行列の一例となる。 This completes the convolution calculation of the top matrix and the weight matrix in the backword processing. In this example, the weight matrix g is an example of the first matrix, and the small top matrix y of t × t is an example of the second matrix.

次に、このようにbackword処理をする場合の格納部５３の機能について詳細に説明する。 Next, the function of the storage unit 53 in the case of performing the backword processing in this way will be described in detail.

格納部５３は、次の式（２３）のように各配列の要素を並べ、各要素をDPE0〜DPE7の各バンクR#0〜R#7に格納する。 The storage unit 53 arranges the elements of each array as shown in the following equation (23), and stores each element in each bank R # 0 to R # 7 of DPE0 to DPE7.

ここで、Nをバッチ数とすると、（Nの個数）=（N_majorの個数）×（N_minorの個数）、（Coutの個数）=（Cout_majorの個数）×（Cout_minorの個数）である。この場合、式（５）と同様に組（N_major、N_minor）でバッチ数Nが特定される。なお、このバックワード処理においては、バッチ数Nは、小top行列yを識別する第２の識別子の一例である。 Here, assuming that N is the number of batches, (the number of N) = (the number of N _major ) × (the number of N _minor ), (the number of Cout) = (the number of Cout _major ) × (the number of Cout _minor ). is there. In this case, the number of batches N is specified by the set (N _major , N _minor ) as in the equation (5). In this backward processing, the batch number N is an example of a second identifier that identifies the small top matrix y.

また、出力チャネル番号Coutも組（Cout_major、Cout_minor）で特定される。例えば、Cout_major=0、Cout_minor=0の配列要素はCout=0に対応し、Cout_major=0、Cout_minor=1の配列要素はCout=1に対応する。そして、このバックワード処理においては、出力チャネル番号Coutが、小top行列yを識別する第１の識別子となる。 The output channel number Cout is also specified as a pair (Cout _major , Cout _minor ). For example, an array element with Cout _major = 0 and Cout _minor = 0 corresponds to Cout = 0, and an array element with Cout _major = 0 and Cout _minor = 1 corresponds to Cout = 1. Then, in this backward processing, the output channel number Cout becomes the first identifier for identifying the small top matrix y.

更に、この例では、図２と同様にバッチ数Nの総数を64、出力チャネル番号Coutの総数を384とする。そして、N_majorの総数を16、N_minorの総数を4、Cout_minorの総数を4とする。 Further, in this example, the total number of batches N is 64 and the total number of output channel numbers Cout is 384 as in FIG. Then, _let the total number of N _{majors be} 16, the total number of N _{minors be} 4, and the total number of Cout _{minors be} 4.

また、配列yにおける要素[H’’][W’’]は、t×tの小top行列yの各要素に対応する。 Further, the elements [H ″] [W ″] in the array y correspond to each element of the small top matrix y of t × t.

図３０は、格納部５３によって各配列y、gが格納されたDPE0〜DPE7の各々のレジスタG#0の中身を示す図である。 FIG. 30 is a diagram showing the contents of each register G # 0 of DPE0 to DPE7 in which the arrays y and g are stored by the storage unit 53.

各配列y、gのうち、配列yについては、格納部５３がシーケンシャル方式でDPE0〜DPE7の各バンクR#0〜R#7に格納する。 Of the arrays y and g, the storage unit 53 stores the array y in each bank R # 0 to R # 7 of DPE0 to DPE7 in a sequential manner.

このとき、本実施形態では、式（２３）のように配列yの最下位にCout_minorを記述し、その上位にN_minorを記述したため、N_minorが同一の範囲で各バンクとCout_minorとが一対一に対応する。そのため、Cout_minorの総数をq(=4)個とすると、一つのDPEにおけるq個のバンクの各々には、出力チャネル番号（Cout_major、Cout_minor）が相互に異なり、かつバッチ数（N_major、N_minor）が同一のq個の小top行列yが格納されることになる。 At this time, in the present embodiment, since Cout _minor is described at the bottom of the array y and N _minor is described at the top of the array y as in equation (23), each bank and Cout _minor have the same range of N _minor. There is a one-to-one correspondence. Therefore, assuming that the total number of Cout _minors is q (= 4), the output channel numbers (Cout _major , Cout _minor ) are different from each other for each of the q banks in one DPE, and the number of batches (N _major). , N _minor ) will store q small top matrices y with the same.

例えば、DPE0においては、R#0〜R#3の４個のバンクの各々に、バッチ数Nが（0、0）であり、かつ出力チャネル番号Coutが（0、0）、（0、1）、（0、2）、（0、3）である４個の小top行列yが格納される。 For example, in DPE0, the number of batches N is (0, 0) and the output channel number Cout is (0, 0), (0, 1) in each of the four banks R # 0 to R # 3. ), (0, 2), (0, 3), four small top matrices y are stored.

これにより、図１３のようにバンクR#0〜R#7ごとにバッチ数Nを変える例とは異なり、同一のバッチ数Nを有するq個の小top行列yの畳み込み計算をq個の計算コアで並列して実行することができる。 As a result, unlike the example in which the batch number N is changed for each bank R # 0 to R # 7 as shown in FIG. 13, the convolution calculation of q small top matrices y having the same batch number N is calculated as q. Can be run in parallel on the core.

一方、weight行列gについては、格納部５３が図２２の例と同様にマルチキャスト方式によりメインメモリ１１から各DPE0〜DPE7に転送する。 On the other hand, regarding the weight matrix g, the storage unit 53 transfers the weight matrix g from the main memory 11 to each DPE0 to DPE7 by a multicast method as in the example of FIG.

なお、図１５を参照して説明したように、マルチキャスト方式では、入力チャネル番号Cinと出力チャネルCoutの値に規則性がない。よって、この例においても、図２３〜図２５と同様にして計算部５４が配列gを整列させる。 As described with reference to FIG. 15, in the multicast method, there is no regularity in the values of the input channel number Cin and the output channel Cout. Therefore, also in this example, the calculation unit 54 aligns the array g in the same manner as in FIGS. 23 to 25.

次に、このバックワード処理における畳み込み計算の計算時間について検討する。 Next, the calculation time of the convolution calculation in this backward processing will be examined.

式（２２）のB^TyBを一つのDPEで求めるのに要する計算時間は、式（６）におけるCin’をCout’に置き換えることにより、次の式（２４）のように書ける。 The calculation time required to obtain B ^T yB of equation (22) with one DPE can be written as the following equation (24) by replacing Cin'in equation (6) with Cout'.

また、式（２２）のGgG^Tを一つのDPEで求めるのに要する計算時間は、式（７）と同じ理由により、式（２５）のように書ける。 The calculation time required to determine the GGG ^T in one of DPE of formula (22), for the same reason as the formula (7) can be written as equation (25).

更に、式（２２）において行列B^TyBとGgG^Tのそれぞれの要素ごとを乗算するときの乗算の回数は、式（８）と同様に次の式（２６）で表される。 Further, in the equation (22), the number of multiplications when multiplying each element of the matrices B ^T y B and G g G ^T is expressed by the following equation (26) as in the equation (8).

そして、全てのtop行列とweight行列との畳み込み計算を行うには、式（９）のpをCout’に置き換えた次の式（２７）の回数だけ計算を行う必要がある。 Then, in order to perform the convolution calculation of all the top matrix and the weight matrix, it is necessary to perform the calculation as many times as the next equation (27) in which p in the equation (9) is replaced with Cout'.

一つのDPEで畳み込み計算をするときの計算時間を表す第１の関数f(t,q)は、式（２４）〜（２６）の和に式（２７）を乗じることにより以下の式（２８）のように表すことができる。 The first function f (t, q), which expresses the calculation time when performing convolution calculation with one DPE, is the following formula (28) by multiplying the sum of formulas (24) to (26) by formula (27). ) Can be expressed as.

次に、小top行列yとweight行列gの各々の要素数がレジスタに格納可能な個数を超えない条件について検討する。
まず、小top行列yの要素数について説明する。 Next, consider the condition that the number of elements of each of the small top matrix y and the weight matrix g does not exceed the number that can be stored in the register.
First, the number of elements of the small top matrix y will be described.

一つのDPEの一つのバンクにおける小top行列yの要素数E_yは、式（１１）のCin’をCout’に置き換えることにより次の式（２９）で表すことができる。 The number of elements E _y of the small top matrix y in one bank of one DPE can be expressed by the following equation (29) by replacing Cin'in equation (11) with Cout'.

一方、一つのDPEの一つのバンクにおけるweight行列gの要素数E_wは、式（１２）と同様に次の式（３０）で表すことができる。 On the other hand, the number of elements E _w of the weight matrix g in one bank of one DPE can be expressed by the following equation (30) as in equation (12).

式（２９）と式（３０）より、小top行列yとweight行列gとを合わせた要素の総数を表す第２の関数g(t,q)は次の式（３１）のように書ける。 From equations (29) and (30), the second function g (t, q) representing the total number of elements including the small top matrix y and the weight matrix g can be written as the following equation (31).

よって、一つのバンクに格納できるデータの個数をRとすると、次の式（３２）の制約条件が得られる。 Therefore, assuming that the number of data that can be stored in one bank is R, the constraint condition of the following equation (32) can be obtained.

以上により、式（３２）の制約条件を満たすt,qの組み合わせのうちで、式（２８）の第１の関数f(t,q)の値を最小とするようなtとqの組み合わせを見つけることにより、畳み込み計算を高速化できることになる。 As described above, among the combinations of t and q that satisfy the constraint condition of the equation (32), the combination of t and q that minimizes the value of the first function f (t, q) of the equation (28) is selected. By finding it, the convolution calculation can be speeded up.

そこで、この例のようにtop行列とweight行列とを畳み込んで小bottom行列dを得るバックワード処理をする場合には、算出部４２は、式（３２）の制約条件を満たすt,qの組み合わせを特定する。そして、特定した組み合わせのうち、式（２８）の第１の関数f(t,q)の値を最小とするようなtとqの組み合わせを算出部４２が算出し、畳み込み計算を高速化する。 Therefore, when performing backward processing to obtain a small bottom matrix d by convolving the top matrix and the weight matrix as in this example, the calculation unit 42 satisfies the constraint conditions of the equation (32) of t and q. Identify the combination. Then, among the specified combinations, the calculation unit 42 calculates the combination of t and q that minimizes the value of the first function f (t, q) of the equation (28), and speeds up the convolution calculation. ..

次に、top行列とbottom行列とを畳み込んでweight行列を得るバックワード処理について説明する。 Next, the backward processing for convolving the top matrix and the bottom matrix to obtain the weight matrix will be described.

図３１〜図３２は、backword処理において、top行列とbottom行列との畳み込み計算をWinogradアルゴリズムで行うときの模式図である。 31 to 32 are schematic diagrams when the convolution calculation of the top matrix and the bottom matrix is performed by the Winograd algorithm in the backword processing.

まず、図３１（ａ）に示すように、選択部５２が、H×Wのtop行列からt’×t’の小top行列yを選択する。 First, as shown in FIG. 31A, the selection unit 52 selects a small top matrix y of t ′ × t ′ from the top matrix of H × W.

そして、図３１（ｂ）に示すように、選択部５２が、H’×W’のbottom行列から(t’-2)×(t’-2)の小bottom行列dを選択する。 Then, as shown in FIG. 31 (b), the selection unit 52 selects the small bottom matrix d of (t'-2) x (t'-2) from the bottom matrix of H'x W'.

続いて、図３２（ａ）に示すように、計算部５４が、小top行列yから(t’-2)×(t’-2)の行列y’を選択する。そして、計算部５４が、次の式（３３）に従ってweight行列gの11成分を求める。 Subsequently, as shown in FIG. 32 (a), the calculation unit 54 selects the matrix y'of (t'-2) x (t'-2) from the small top matrix y. Then, the calculation unit 54 obtains 11 components of the weight matrix g according to the following equation (33).

次に、図３２（ｂ）に示すように、小top行列yから行列y’を選択する位置を図３２（ａ）の場合よりも１列ずらし、選択した行列y’に対して計算部５４が上記と同じ計算を行うことにより、weight行列gの12成分を求める。 Next, as shown in FIG. 32 (b), the position for selecting the matrix y'from the small top matrix y is shifted by one column from the case of FIG. 32 (a), and the calculation unit 54 with respect to the selected matrix y'. Finds the 12 components of the weight matrix g by performing the same calculation as above.

このように小top行列yから行列y’を切り出す位置を列方向と行方向にずらすことにより、図３２（ｃ）に示すように、３×３のweight行列gの各要素を得ることができる。 By shifting the position of cutting out the matrix y'from the small top matrix y in the column direction and the row direction in this way, each element of the 3 × 3 weight matrix g can be obtained as shown in FIG. 32 (c). ..

以上により、backword処理におけるtop行列とbottom行列との畳み込み計算を終える。この例では、(t’-2)×(t’-2)の小bottom行列dが第１の行列の一例となり、t’×t’の小top行列yが第２の行列の一例となる。 This completes the convolution calculation of the top matrix and bottom matrix in the backword process. In this example, the small bottom matrix d of (t'-2) × (t'-2) is an example of the first matrix, and the small top matrix y of t'× t'is an example of the second matrix. ..

次に、このbackword処理をする場合の格納部５３の機能について詳細に説明する。 Next, the function of the storage unit 53 when performing this backword processing will be described in detail.

格納部５３は、次の式（３４）のように各配列の要素を並べ、各要素をDPE0〜DPE7の各バンクR#0〜R#7に格納する。 The storage unit 53 arranges the elements of each array as shown in the following equation (34), and stores each element in each bank R # 0 to R # 7 of DPE0 to DPE7.

この例でも、バッチ数N（=（N_major、N_minor））と入力チャネル番号Cin（=（Cin_major、Cin_minor））との組み合わせにより小bottom行列dが特定される。なお、バッチ数N（=（N_major、N_minor））は第１の識別子の一例であり、入力チャネル番号Cin（=（Cin_major、Cin_minor））は第２の識別子の一例である。 In this example as well, the small bottom matrix d is specified by the combination of the batch number N (= (N _major , N _minor )) and the input channel number Cin (= (Cin _major , Cin _minor )). The batch number N (= (N _major , N _minor )) is an example of the first identifier, and the input channel number Cin (= (Cin _major , Cin _minor )) is an example of the second identifier.

図３３は、格納部５３によって各配列y、dが格納されたDPE0〜DPE7の各々のレジスタG#0の中身を示す図である。 FIG. 33 is a diagram showing the contents of each register G # 0 of DPE0 to DPE7 in which the arrays y and d are stored by the storage unit 53.

配列dについては、格納部５３がシーケンシャル方式でDPE0〜DPE7の各バンクR#0〜R#7に格納する。 The storage unit 53 stores the array d in each bank R # 0 to R # 7 of DPE0 to DPE7 in a sequential manner.

このとき、本実施形態では、式（３４）のように配列dの最下位にN_minorを記述し、その上位にCin_minorを記述したため、Cin_minorが同一の範囲で各バンクとN_minorとが一対一に対応する。そのため、N_minorの総数をq(=4)個とすると、一つのDPEにおけるq個のバンクの各々には、バッチ数（N_major、N_minor）が相互に異なり、かつ入力チャネル番号（Cin_major、Cin_minor）が同一のq個の小bottom行列dが格納されることになる。 At this time, in the present embodiment, since N _minor is described at the bottom of the array d and Cin _minor is described at the top of the array d as in equation (34), each bank and N _minor have the same range of Cin _minor. There is a one-to-one correspondence. Therefore, _{assuming that} the total number of N _minors is q (= 4), the number of batches (N _major , N _minor ) is different from each other for each of the q banks in one DPE, and the input channel number (Cin _major). , Cin _minor ) will be stored as q small bottom matrices d.

例えば、DPE0においては、R#0〜R#3の４個のバンクの各々に、入力チャネル番号Cinが（0、0）であり、かつバッチ数Nが（0、0）、（0、1）、（0、2）、（0、3）である４個の小bottom行列dが格納される。 For example, in DPE0, the input channel number Cin is (0, 0) and the number of batches N is (0, 0), (0, 1) in each of the four banks R # 0 to R # 3. ), (0, 2), (0, 3), four small bottom matrices d are stored.

これにより、図１３のようにバンクR#0〜R#7ごとにバッチ数Nを変える例とは異なり、同一の入力チャネル番号Cinを有するq個の小bottom行列dの畳み込み計算をq個の計算コアで並列して実行することができる。 As a result, unlike the example in which the number of batches N is changed for each bank R # 0 to R # 7 as shown in FIG. 13, the convolution calculation of q small bottom matrices d having the same input channel number Cin is performed by q. It can be executed in parallel on the compute core.

また、小top行列yについては、格納部５３がマルチキャスト方式によりメインメモリ１１から各DPE0〜DPE7に転送する。 Further, for the small top matrix y, the storage unit 53 transfers the small top matrix y from the main memory 11 to each DPE0 to DPE7 by a multicast method.

なお、図３０の例とは異なり、この例では式（３４）のように配列yの最下位にCout_minorを記述し、その上位にN_minorを記述する。また、Cout_minorの総数は４個とし、N_minorの総数は４個とする。 Unlike the example of FIG. 30, in this example, Cout _minor is described at the bottom of the array y and N _minor is described at the top of the array y as shown in equation (34). The total number of Cout _minors is 4, and the total number of N _minors is 4.

これにより、例えばDPE0においては、N_major=0かつN_minor=0の配列yの要素のうち、Cout_minorの値が小さい要素から順にバンクR#0〜R#3に格納される。そして、バンクR#4〜R#7には、N_major=0かつN_minor=1の要素が、Cout_minorの値が小さい順に格納される。 As a result, for example, in DPE0, among the elements of the array y of N _major = 0 and N _minor = 0, the elements having the smallest value of Cout _minor are stored in banks R # 0 to R # 3 in order. Then, in banks R # 4 to R # 7, elements of N _major = 0 and N _minor = 1 are stored in ascending order of the value of Cout _minor .

また、配列yのN_major=1の要素についても、Cout_minorの値が小さい要素から順にバンクR#0〜R#3に格納され、バンクR#0〜R#3にはN_minorの値が一つ繰り上がった要素が格納されていく。 Also, the elements of N _major = 1 in the array y are stored in banks R # 0 to R # 3 in order from the element with the smallest value of Cout _minor , and the values of N _minor are stored in banks R # 0 to R # 3. The element that is moved up by one is stored.

これにより、一つのバンクにはCout_minor値が同一の配列yの要素が格納されるようになるため、バンク内でCout_minor値を揃えるために配列yの各要素を整列させる必要がない。 As a result, the elements of the array y having the same Cout _minor value are stored in one bank, so that it is not necessary to align each element of the array y in order to align the Cout _minor values in the bank.

式（３３）のGy’G^Tを一つのDPEで求めるのに要する計算時間は、式（２４）におけるtをt’に置き換えることにより、次の式（３５）のように書ける。 The calculation time required to obtain Gy'G ^T in Eq. (33) with one DPE can be written as the following Eq. (35) by replacing t in Eq. (24) with t'.

また、式（３３）のB^TdBを一つのDPEで求めるのに要する計算時間は、式（２５）の3をt’-2に置き換え、tをt’に置き換え、cout’をN’に置き換えることにより、次の式（３６）のように書ける。 In addition, the calculation time required to obtain the B ^T dB of equation (33) with one DPE is as follows: replace 3 in equation (25) with t'-2, replace t with t', and replace cout'with N'. By substituting, it can be written as the following equation (36).

更に、式（３３）において行列Gy’G^Tと行列B^TdBとのそれぞれの要素ごとを乗算するときの乗算の回数は、式（８）と同様に次の式（３７）で表される。 Further, in the equation (33), the number of multiplications when multiplying each element of the matrix Gy'G ^T and the matrix B ^T dB is expressed by the following equation (37) as in the equation (8). ..

そして、全てのtop行列とweight行列との畳み込み計算を行うには、式（２７）と同様に次の式（３８）の回数だけ計算を行う必要がある。 Then, in order to perform the convolution calculation of all the top matrix and the weight matrix, it is necessary to perform the calculation as many times as the following equation (38) as in the equation (27).

一つのDPEで畳み込み計算をするときの計算時間を表す第１の関数f(t,q)は、式（３５）〜（３７）の和に式（３８）を乗じることにより以下の式（３９）のように表すことができる。 The first function f (t, q), which expresses the calculation time when performing convolution calculation with one DPE, is the following formula (39) by multiplying the sum of formulas (35) to (37) by formula (38). ) Can be expressed as.

次に、小bottom行列dと小top行列yの各々の要素数がレジスタに格納可能な個数を超えない条件について検討する。 Next, consider the condition that the number of elements of each of the small bottom matrix d and the small top matrix y does not exceed the number that can be stored in the register.

まず、小top行列yの要素数について説明する。一つのDPEの一つのバンクにおける小top行列yの要素数E_yは、次の式（４０）のように書ける。 First, the number of elements of the small top matrix y will be described. The number of elements E _y of the small top matrix y in one bank of one DPE can be written as the following equation (40).

式（４０）において、t²は、一つの小top行列yの要素数である。また、N’・Cin’/pは、一つのバンクに格納される小top行列yの個数である。 In equation (40), t ² is the number of elements in one small top matrix y. N'・ Cin'/ p is the number of small top matrices y stored in one bank.

一方、一つのDPEの一つのバンクにおける小bottom行列dの要素数E_dは、次の式（４１）のように書ける。 On the other hand, the number of elements E _d of the small bottom matrix d in one bank of one DPE can be written as the following equation (41).

式（４１）において、(t’-2)²は、一つの小bottom行列dの要素数である。また、N’・Cout’/pは、一つのバンクに格納される小bottom行列dの個数である。 In equation (41), (t'-2) ² is the number of elements in one small bottom matrix d. N'・ Cout' / p is the number of small bottom matrices d stored in one bank.

式（２９）と式（３０）より、小top行列yとweight行列gとを合わせた要素の総数を表す第２の関数g(t,q)は次の式（４２）のように書ける。 From equations (29) and (30), the second function g (t, q) representing the total number of elements including the small top matrix y and the weight matrix g can be written as the following equation (42).

よって、一つのバンクに格納できるデータの個数をRとすると、次の式（４３）の制約条件が得られる。 Therefore, assuming that the number of data that can be stored in one bank is R, the constraint condition of the following equation (43) can be obtained.

以上により、式（４３）の制約条件を満たすt,qの組み合わせのうちで、式（３９）の第１の関数f(t,q)の値を最小とするようなtとqの組み合わせを見つけることにより、畳み込み計算を高速化できることになる。 As described above, among the combinations of t and q that satisfy the constraint condition of the equation (43), the combination of t and q that minimizes the value of the first function f (t, q) of the equation (39) is selected. By finding it, the convolution calculation can be speeded up.

そこで、この例のようにbottom行列とtop行列とを畳み込んでweight行列を得るバックワード処理をする場合には、算出部４２は、式（４３）の制約条件を満たすt,qの組み合わせを特定する。そして、特定した組み合わせのうち、式（３９）の第１の関数f(t,q)の値を最小とするようなtとqの組み合わせを算出部４２が算出し、畳み込み計算を高速化する。 Therefore, when performing backward processing to obtain a weight matrix by convolving the bottom matrix and the top matrix as in this example, the calculation unit 42 sets a combination of t and q that satisfies the constraint condition of the equation (43). Identify. Then, among the specified combinations, the calculation unit 42 calculates the combination of t and q that minimizes the value of the first function f (t, q) of the equation (39), and speeds up the convolution calculation. ..

＜１×１の畳み込み＞
深層学習においては１×１の畳み込みが行われることがある。例えば、ResNet-50やResNet101においては１×１の畳み込みが使用される。そこで、本実施形態における１×１の畳み込みについて説明する。 <1x1 convolution>
In deep learning, 1x1 convolution may be performed. For example, in ResNet-50 and ResNet101, 1x1 convolution is used. Therefore, the 1 × 1 convolution in the present embodiment will be described.

なお、１×１の畳み込みの対象となる行列は特に限定されないが、以下では小bottom行列dとweight行列gとの畳み込みについて説明する。 The matrix to be convolved by 1 × 1 is not particularly limited, but the convolution of the small bottom matrix d and the weight matrix g will be described below.

行列d、gの１×１の畳み込みを行う場合は、格納部５３は、次の式（４４）のように各行列の要素を配列に格納し、各要素をDPE0〜DPE7の各バンクR#0〜R#7に格納する。 When convolving 1 × 1 of the matrices d and g, the storage unit 53 stores the elements of each matrix in an array as shown in the following equation (44), and stores each element in each bank R # of DPE0 to DPE7. Store in 0 to R # 7.

式（４４）における各配列d、gの要素の並び順は式（５）におけるのと同様である。例えば、配列dにおいては最下位にCin_minorが記述され、その上位にN_minorが記述される。 The order of the elements of the arrays d and g in the equation (44) is the same as that in the equation (5). For example, in the array d, Cin _minor is described at the bottom, and N _minor is described above it.

図３４は、１×１の畳み込みを行う場合に、格納部５３によって各配列d、gが格納されたDPE0のレジスタG#0の中身を示す図である。 FIG. 34 is a diagram showing the contents of the register G # 0 of DPE0 in which the arrays d and g are stored by the storage unit 53 when 1 × 1 convolution is performed.

式（５）の場合には図２２のようにシーケンシャル方式によりDPE0〜DPE7に配列dを格納したが、この例ではマルチキャスト方式によりDPE0〜DPE7に配列dを格納する。 In the case of equation (5), the array d is stored in DPE0 to DPE7 by the sequential method as shown in FIG. 22, but in this example, the array d is stored in DPE0 to DPE7 by the multicast method.

これにより、例えばN_major=0かつN_minor=0の要素は、Cin_minor=0,1,2,3の順にバンクR#0,R#1,R#2,R#3に格納されていく。そして、N_major=0かつN_minor=0の全ての要素が格納されると、次はN_major=0かつN_minor=1の要素がCin_minor=0,1,2,3の順にバンクR#4,R#5,R#6,R#7に格納されていく。これにより各バンクR#0〜R#7の最初の一つのラインが埋まるため、N_minor=2以降の要素は一つ上のラインに格納される。 As a result, for example, the elements of N _major = 0 and N _minor = 0 are stored in banks R # 0, R # 1, R # 2, R # 3 in the order of Cin _minor = 0,1,2,3. .. Then, when all the elements of N _major = 0 and N _minor = 0 are stored, the next element of N _major = 0 and N _minor = 1 is bank R # in the order of Cin _minor = 0,1,2,3. It is stored in 4, R # 5, R # 6, and R # 7. As a result, the first line of each bank R # 0 to R # 7 is filled, so the elements after N _minor = 2 are stored in the line one level above.

なお、N_major=1の配列dの要素については、N_major=0の要素の畳み込みが終了した後にDPE0に展開される。N_majorの値が2以上の配列dの要素についても同様である。 Note that the elements of the array d with N _major = 1 are expanded to DPE0 after the convolution of the elements with N _major = 0 is completed. The same applies to the elements of the array d in which the value of N _major is 2 or more.

また、配列gについても、マルチキャスト方式によりバンクR#0に配列dを格納する。 Also, for the array g, the array d is stored in the bank R # 0 by the multicast method.

１×１の畳み込みに適用可能なWinogradアルゴリズムは存在しない。よって、この例では、各バンクR#0〜R#7に格納された要素を用いて、計算部５４が図３（ａ）〜（ｃ）に示した手順で畳み込みを行う。 There is no Winograd algorithm applicable to 1x1 convolution. Therefore, in this example, the calculation unit 54 performs convolution according to the procedure shown in FIGS. 3 (a) to 3 (c) using the elements stored in each of the banks R # 0 to R # 7.

＜batch normalization＞
深層学習においては、batch normalizationを行うことにより性能が向上する場合がある。batch normalizationは、複数の画像の間で画素データの値が大きく異なる場合に、各画像の画素データの平均値を０にし、かつその分散を１にする規格化の手法である。その手法について以下に説明する。 <Batch normalization>
In deep learning, performance may be improved by performing batch normalization. Batch normalization is a standardization method in which the average value of the pixel data of each image is set to 0 and the variance is set to 1 when the values of the pixel data are significantly different between the plurality of images. The method will be described below.

batch normalizationを行う場合は、格納部５３は、次の式（４５）のように各配列d、yの各要素を並べ、各要素をDPE0〜DPE7の各バンクR#0〜R#7にマルチキャスト方式で格納する。 When batch normalization is performed, the storage unit 53 arranges each element of each array d and y as shown in the following equation (45), and multicasts each element to each bank R # 0 to R # 7 of DPE0 to DPE7. Store by method.

batch normalizationは、bottom行列とtop行列のどちらにも適用することができる。以下では、bottom行列の一部である小bottom行列dに対してbatch normalizationを行う場合について説明する。 batch normalization can be applied to both bottom and top matrices. The case where batch normalization is performed on the small bottom matrix d, which is a part of the bottom matrix, will be described below.

図３５は、batch normalizationのときに格納部５３によって小bottom行列dが格納されたDPE0のレジスタG#0の中身を示す図である。 FIG. 35 is a diagram showing the contents of the register G # 0 of DPE0 in which the small bottom matrix d is stored by the storage unit 53 at the time of batch normalization.

この例では、図３４におけるのと同様に、格納部５３が、マルチキャスト方式によりバンクR#0に小bottom行列dを格納する。式（４５）に示すように、小bottom行列dの最下位にはCin_minorが記述される。よって、各バンクR#0〜R#7のうちの一つに着目すると、そのバンクにはCin_minorの値が同じ要素が格納される。例えば、バンクR#0には、Cin_minor=0の要素のみが格納される。 In this example, as in FIG. 34, the storage unit 53 stores the small bottom matrix d in the bank R # 0 by the multicast method. As shown in equation (45), Cin _minor is described at the bottom of the small bottom matrix d. Therefore, focusing on one of the banks R # 0 to R # 7, elements with the same Cin _minor value are stored in that bank. For example, bank R # 0 stores only elements with Cin _minor = 0.

また、式（４５）によれば、小bottom行列dにおいてCin_minorの上位にN_minorが記述される。そのため、各バンクR#0〜R#7のうちの一つに着目すると、そのバンクには、バッチ数（N_major、N_minor）が異なる要素が格納される。例えば、バンクR#0には、（N_major、N_minor）=（0、0）、（0、2）、…（0、14）、（1、0）、（1、2）、…（1、14）、…（3、0）、（3、2）、…（3、14）の要素が格納される。 Further, according to the equation (45), N _minor is described above Cin _{minor in} the small bottom matrix d. Therefore, focusing on one of the banks R # 0 to R # 7, elements with different batch numbers (N _major , N _minor ) are stored in that bank. For example, in bank R # 0, (N _major , N _minor ) = (0, 0), (0, 2), ... (0, 14), (1, 0), (1, 2), ... ( Elements of 1, 14), ... (3, 0), (3, 2), ... (3, 14) are stored.

このように、一つのバンクには、Cin_minorが同じでバッチ数（N_major、N_minor）が異なる要素が格納される。そのため、計算コアC#0〜C#7の各々が、自身に対応する一つのバンクのみを用いて、Cin_minorが同じでバッチ数（N_major、N_minor）が異なる複数の要素の平均と、これらの要素の分散とを計算することができる。 In this way, one bank stores elements with the same Cin _minor but different batch numbers (N _major , N _minor ). Therefore, each of the calculation cores C # 0 to C # 7 uses only one bank corresponding to itself, and the average of multiple elements with the same Cin _minor but different batch numbers (N _major , N _minor ), and The variance of these elements can be calculated.

その計算は、計算部５４によって以下のように実行される。
図３６（ａ）、（ｂ）は、batch normalizationのときに計算部５４が行う計算について説明するためのDPE0のレジスタG#0の中身を示す図である。 The calculation is executed by the calculation unit 54 as follows.
36 (a) and 36 (b) are diagrams showing the contents of register G # 0 of DPE0 for explaining the calculation performed by the calculation unit 54 at the time of batch normalization.

まず、図３６（ａ）に示すように、計算コアC#0が、バンクR#0にある小bottom行列dの各要素の値を加算し、これにより得られた値x₀をバンクR#0のラインL_{sum_1}に格納する。他のバンクR#1〜R#7においても、計算コアC#1〜C#7の各々が、対応するバンクにある小bottom行列dの各要素の値を加算し、これにより得られた値x₁〜x₇をそれぞれバンクR#1〜R#7のラインL_{sum_1}に格納する。 First, as shown in FIG. 36 (a), the calculation core C # 0 adds the values of each element of the small bottom matrix d in the bank R # 0, and the value x ₀ obtained by this is added to the bank R #. Store in line L _{sum_1} of 0. In the other banks R # 1 to R # 7, each of the calculation cores C # 1 to C # 7 adds the values of each element of the small bottom matrix d in the corresponding bank, and the value obtained by this is added. Store x _{1 to} x 7 in line L _{sum_1} of banks R # _{1 to} R # ₇ , respectively.

ここで、図３５に示されるように、バンクR#0にはN_minorが偶数の要素のみが格納される。そのため、値x₀は、全てのバッチ数（N_major、N_minor）にわたる要素の合計ではなく、N_minorが偶数の要素の値のみを合計したものとなる。 Here, as shown in FIG. 35, only the elements having an even N _minor are stored in the bank R # 0. Therefore, the value x ₀ is not the sum of the elements over all batch numbers (N _major , N _minor ), but the sum of only the values of the elements with even N _minor .

そこで、計算部５４は、値x₀〜x₇のうちで同一のCin_minorに対応するもの同士を加算する。例えば、値x₀と値x₄は両方ともCin_minor=0に対応するため、計算部５４は、両者を加算してその結果を値x₀に書き込む。これにより得られた値x₀は、Cin_minor=0の要素を全てのバッチ数（N_major、N_minor）にわたって合計した値となる。同様にして、計算部５４は次の計算を行う。
x₁=x₁+x₅
x₂=x₂+x₆
x₃=x₃+x₇ Therefore, the calculation unit 54 adds the values x _{0 to} x ₇ corresponding to the same Cin _minor . For example, since the value x ₀ and the value x ₄ both correspond to Cin _minor = 0, the calculation unit 54 adds both and writes the result in the value x ₀ . The value x ₀ obtained as a result is the sum of the elements of Cin _minor = 0 over all batch numbers (N _major , N _minor ). Similarly, the calculation unit 54 performs the following calculation.
x ₁ = x ₁ + x ₅
x ₂ = x ₂ + x ₆
x ₃ = x ₃ + x ₇

次に、計算コアC#0が、バンクR#0に格納した値x₀をバッチ数で割ることにより平均値m₀を計算し、その平均値m₀をバンクR#0のラインL_meanに格納する。バンクR#1〜R#3においても、計算コアC#1〜C#3の各々が値x₁〜x₃の平均値m₁〜m₃を計算し、これらの値をそれぞれバンクR#1〜R#3のラインL_meanに格納する。 Next, the calculation core C # 0 calculates the mean value m ₀ by dividing the value x ₀ stored in the bank R # 0 by the number of batches, and sets the mean value m ₀ to the line L _mean of the bank R # 0. Store. In banks R # 1 to R # 3, each of the calculation cores C # 1 to C # 3 calculates the mean value m _{1 to} m ₃ of the values x _{1 to} x ₃ , and these values are used as banks R # 1 respectively. ~ Store in line L _mean of R # 3.

以上により、バンクR#0〜R#3ごとに小bottom行列dの要素の平均値m₀〜m₃が得られたことになる。
次に、分散を求める計算方法について説明する。 From the above, the average values m _{0 to} m ₃ of the elements of the small bottom matrix d are obtained for each bank R # 0 to R # 3.
Next, a calculation method for obtaining the variance will be described.

まず、図３６（ｂ）に示すように、計算コアC#0が、バンクR#0にある小bottom行列dの各要素の値を二乗し、これにより得られた各値を合計した値y₀をバンクR#0のラインL_{sum_2}に格納する。他のバンクR#1〜R#7においても、計算コアC#1〜C#7の各々が、対応するバンクにある各要素を二乗してそれらを加算し、これにより得られた値y₁〜y₇をそれぞれバンクR#1〜R#7のラインL_{sum_2}に格納する。 First, as shown in FIG. 36 (b), the calculation core C # 0 squares the values of each element of the small bottom matrix d in the bank R # 0, and the sum of the values obtained thereby y. ₀ is stored in the line _L sum_2 of the bank R # 0. In the other banks R # 1 to R # 7, each of the calculation cores C # 1 to C # 7 squares each element in the corresponding bank and adds them, resulting in the value y ₁ Store ~ y 7 in line L _{sum_2} of banks R # 1 to R # ₇ , respectively.

図３６（ａ）の例と同様に、値y₀は、全てのバッチ数（N_major、N_minor）にわたる要素の二乗の合計ではなく、N_minorが偶数の要素を二乗した値のみを合計したものとなる。そこで、計算部５４は、次の計算を行うことにより、全てのバッチ数（N_major、N_minor）にわたる小bottom行列dの要素の二乗の合計を値y₀〜y₃の各々に書き込む。
y₀=y₀+y₄
y₁=y₁+y₅
y₂=y₂+y₆
y₃=y₃+y₇ Similar to the example in FIG. 36 (a), the value y ₀ is not the sum of the squares of the elements over all batch numbers (N _major , N _minor ), but only the squared values of the elements where N _minor is an even number. It becomes a thing. Therefore, the calculation unit 54 writes the sum of the squares of the elements of the small bottom matrix d over all the batch numbers (N _major , N _minor ) in each of the values y ₀ to y ₃ by performing the following calculation.
y ₀ = y ₀ + y ₄
y ₁ = y ₁ + y ₅
y ₂ = y ₂ + y ₆
y ₃ = y ₃ + y ₇

次に、計算コアC#0が、バンクR#0に格納した値y₀をバッチ数で割ることにより平均値a₀を計算し、その平均値a₀をバンクR#0のラインL_{mean_2}に格納する。バンクR#1〜R#3においても、計算コアC#1〜C#3の各々が値y₁〜y₃の平均値a₁〜a₃を計算し、これらの値をそれぞれバンクR#1〜R#3のラインL_{mean_2}に格納する。 Next, the calculation core C # 0 calculates the mean value a ₀ by dividing the value y ₀ stored in the bank R # 0 by the number of batches, and _{puts the} mean value a ₀ into the line L _{mean_2} of the bank R # 0. Store. In banks R # 1 to R # 3, each of the calculation cores C # 1 to C # 3 calculates the mean value a _{1 to} a ₃ of the values y _{1 to} y ₃ , and these values are used as banks R # 1 respectively. ~ Store in line L _{mean_2} of R # 3.

以上により、バンクR#0〜R#3ごとに小bottom行列dの要素の二乗の平均値a₀〜a₃が得られたことになる。 Thus, so that the bank R # 0 to R # mean value of the square of the elements of the small bottom matrix d every ₃ a ₀ ~a 3 was obtained.

次に、計算部５４は、v₀=a₀-m₀ ²を計算することにより、バンクR#0にある小bottom行列dの各要素の分散v₀を算出し、それをバンクR#0のラインL_varに格納する。これと同様に、計算部５４が以下の計算を行うことによりバンクR#1〜R#3の各要素の分散v₁〜v₃を算出し、それをバンクR#1〜R#3のラインL_varに格納する。
v₁=a₁-m₁ ²
v₂=a₂-m₂ ²
v₃=a₃-m₃ ² Next, the calculation unit 54 calculates the variance v ₀ of each element of the small bottom matrix d in the bank R # 0 by calculating v ₀ = a ₀ -m ₀ ^2, and uses it as the bank R # 0. Store in the line L _var of. Similarly, the calculation unit 54 calculates the variance v ₁ to v ₃ of each element of the bank R # 1 to R # 3 by performing the following calculation, it banks R # 1 to R # 3 Line Store in L _var .
v ₁ = a ₁ -m ₁ ²
v ₂ = a ₂ -m ₂ ²
v ₃ = a ₃ -m ₃ ²

この後は、計算部５４は、以下の式（４６）のように小bottom行列dの各要素の値(d[N_major][Cin_major][H][W][N_minor][i])と平均値m_iとの差を分散v_iで割ることにより、Cin_minor=i (i=0,1,2,3)の要素に対してbatch normalizationを行う。 After this, the calculation unit 54 uses the values of each element of the small bottom matrix d (d [N _major ] [Cin _major ] [H] [W] [N _minor ] [i] as shown in the following equation (46). ) and by dividing the difference between the average value m _i in a distributed v _i, performs batch normalization for elements of _{Cin minor = i (i = 0,1,2,3} ).

以上によりbatch normalizationを終える。

This completes batch normalization.

このようにbatch normalizationを行うことで、深層学習における学習性能の向上が期待できる。 By performing batch normalization in this way, improvement in learning performance in deep learning can be expected.

以上説明した各実施形態に関し、更に以下の付記を開示する。
（付記１）複数の第１の行列とt行t列の複数の第２の行列の各々の要素の総数が、レジスタが備える複数の記憶領域のうちのq個の各々に格納できるデータの個数を超えないtとqの組み合わせのうちで、q個の前記記憶領域の各々に対応したq個の計算コアの各々が複数の前記第１の行列と複数の前記第２の行列との畳み込み計算をWinogradアルゴリズムで並列して実行するときの計算時間が最小となる組み合わせを算出する算出部と、
算出したtとqの組み合わせを用いてq個の前記記憶領域の各々に複数の前記第１の行列とt行t列の複数の前記第２の行列とを格納する処理と、q個の前記計算コアの各々がWinogradアルゴリズムを用いて前記第１の行列と前記第２の行列との畳み込み計算を行う処理とを、前記計算コアと前記レジスタとを備えた計算機に実行させるためのプログラムを出力する出力部と、
を有することを特徴とする情報処理装置。
（付記２）前記第１の行列と前記第２の行列の各々は、深層学習の畳み込み層における行列であることを特徴とする付記１に記載の情報処理装置。
（付記３）前記計算時間を第１の関数f(t,q)で表し、かつ一つの前記記憶領域に格納される複数の前記第１の行列と複数の前記第２の行列の各々の前記要素の前記総数を第２の関数g(t,q)で表したときに、前記算出部は、一つの前記記憶領域に格納可能なデータの個数を前記第２の関数g(t,q)の値が超えない範囲内で前記第１の関数f(t,q)の値が最小となるqとtとの組み合わせを算出することを特徴とする付記１に記載の情報処理装置。
（付記４）前記第１の行列と前記第２の行列の各々は、深層学習の畳み込み層における行列であり、
前記深層学習のバックワード処理における前記第１の関数f(t,q)及び前記第２の関数g(t,q)は、前記深層学習のフォワード処理における前記第１の関数f(t,q)及び前記第２の関数g(t,q)とそれぞれ異なることを特徴とする付記３に記載の情報処理装置。
（付記５）複数の前記第２の行列の各々は、第１の識別子と第２の識別子との組み合わせにより特定され、
前記プログラムは、
前記第１の識別子が相互に異なり、かつ前記第２の識別子が同一のq個の前記第２の行列の各々を、q個の前記記憶領域の各々に格納する処理を前記計算機に実行させることを特徴とする付記１に記載の情報処理装置。
（付記６）前記プログラムは、
前記第１の識別子が相互に等しい前記第１の行列と前記第２の行列とを同一の前記記憶領域に格納し、
同一の前記記憶領域に格納された前記第１の行列と前記第２の行列との間で前記畳み込み計算を実行する処理を前記計算機に実行させることを特徴とする付記５に記載の情報処理装置。
（付記７）前記プログラムは、
複数の前記記憶領域ごとに前記要素の値の平均値と分散とを計算し、
複数の前記記憶領域ごとに、前記要素の値と前記平均値との差を前記分散で割ることにより、前記要素の値を規格化する処理を前記計算機に実行させることを特徴とする付記１に記載の情報処理装置。
（付記８）複数の第１の行列とt行t列の複数の第２の行列の各々の要素の総数が、レジスタが備える複数の記憶領域のうちのq個の各々に格納できるデータの個数を超えないtとqの組み合わせのうちで、q個の前記記憶領域の各々に対応したq個の計算コアの各々が複数の前記第１の行列と複数の前記第２の行列との畳み込み計算をWinogradアルゴリズムで並列して実行するときの計算時間が最小となる組み合わせを算出する処理と、
算出したtとqの組み合わせを用いてq個の前記記憶領域の各々に複数の前記第１の行列とt行t列の複数の前記第２の行列とを格納する処理と、q個の前記計算コアの各々がWinogradアルゴリズムを用いて前記第１の行列と前記第２の行列との畳み込み計算を行う処理とを、前記計算コアと前記レジスタとを備えた計算機に実行させるためのプログラムを出力する処理と、
をコンピュータに実行させるための情報処理プログラム。
（付記９）複数の第１の行列とt行t列の複数の第２の行列の各々の要素の総数が、レジスタが備える複数の記憶領域のうちのq個の各々に格納できるデータの個数を超えないtとqの組み合わせのうちで、q個の前記記憶領域の各々に対応したq個の計算コアの各々が複数の前記第１の行列と複数の前記第２の行列との畳み込み計算をWinogradアルゴリズムで並列して実行するときの計算時間が最小となる組み合わせを算出する処理と、
算出したtとqの組み合わせを用いてq個の前記記憶領域の各々に複数の前記第１の行列とt行t列の複数の前記第２の行列とを格納する処理と、q個の前記計算コアの各々がWinogradアルゴリズムを用いて前記第１の行列と前記第２の行列との畳み込み計算を行う処理とを、前記計算コアと前記レジスタとを備えた計算機に実行させるためのプログラムを出力する処理と、
をコンピュータが実行することを特徴とする情報処理方法。 The following additional notes will be further disclosed with respect to each of the above-described embodiments.
(Appendix 1) The total number of elements of each of the plurality of first matrices and the plurality of second matrices of t rows and t columns is the number of data that can be stored in each of q of the plurality of storage areas of the register. Of the combinations of t and q that do not exceed, each of the q calculation cores corresponding to each of the q storage areas is a convolution calculation of the plurality of the first matrix and the plurality of the second matrix. A calculation unit that calculates the combination that minimizes the calculation time when executing in parallel with the Winograd algorithm,
A process of storing a plurality of the first matrix and a plurality of the second matrix of t rows and t columns in each of the q storage areas using the calculated combination of t and q, and q of the above. Outputs a program for causing a computer having the calculation core and the register to perform a process in which each of the calculation cores performs a convolution calculation of the first matrix and the second matrix using the Winograd algorithm. Output section and
An information processing device characterized by having.
(Supplementary note 2) The information processing apparatus according to Supplementary note 1, wherein each of the first matrix and the second matrix is a matrix in a convolutional layer of deep learning.
(Appendix 3) The calculation time is represented by the first function f (t, q), and each of the plurality of the first matrix and the plurality of the second matrix stored in the storage area. When the total number of elements is represented by the second function g (t, q), the calculation unit calculates the number of data that can be stored in one storage area by the second function g (t, q). The information processing apparatus according to Appendix 1, wherein the combination of q and t that minimizes the value of the first function f (t, q) is calculated within a range in which the value of is not exceeded.
(Appendix 4) Each of the first matrix and the second matrix is a matrix in the convolutional layer of deep learning.
The first function f (t, q) and the second function g (t, q) in the backward processing of the deep learning are the first function f (t, q) in the forward processing of the deep learning. ) And the information processing apparatus according to Appendix 3, which is different from the second function g (t, q).
(Appendix 5) Each of the plurality of the second matrices is specified by a combination of the first identifier and the second identifier.
The program
To have the computer execute a process of storing each of the q second matrices having the same second identifier in each of the q storage areas, which are different from each other in the first identifier. The information processing apparatus according to Appendix 1.
(Appendix 6) The program is
The first matrix and the second matrix having the same first identifiers are stored in the same storage area.
The information processing apparatus according to Appendix 5, wherein the computer is made to execute a process of executing the convolution calculation between the first matrix and the second matrix stored in the same storage area. ..
(Appendix 7) The program is
The mean value and the variance of the value of the element are calculated for each of the plurality of storage areas.
Addendum 1 characterized in that the computer is made to execute a process of normalizing the value of the element by dividing the difference between the value of the element and the average value by the variance for each of the plurality of storage areas. The information processing device described.
(Appendix 8) The total number of elements of each of the plurality of first matrices and the plurality of second matrices of t rows and t columns is the number of data that can be stored in each of q of the plurality of storage areas of the register. Of the combinations of t and q that do not exceed, each of the q calculation cores corresponding to each of the q storage areas is a convolution calculation of the plurality of the first matrix and the plurality of the second matrix. To calculate the combination that minimizes the calculation time when executing in parallel with the Winograd algorithm, and
A process of storing a plurality of the first matrix and a plurality of the second matrix of t rows and t columns in each of the q storage areas using the calculated combination of t and q, and q of the above. Outputs a program for causing a computer having the calculation core and the register to perform a process in which each of the calculation cores performs a convolution calculation of the first matrix and the second matrix using the Winograd algorithm. Processing to do and
An information processing program that allows a computer to execute.
(Appendix 9) The total number of elements of each of the plurality of first matrices and the plurality of second matrices of t rows and t columns is the number of data that can be stored in each of q of the plurality of storage areas of the register. Of the combinations of t and q that do not exceed, each of the q calculation cores corresponding to each of the q storage areas is a convolution calculation of the plurality of the first matrix and the plurality of the second matrix. To calculate the combination that minimizes the calculation time when executing in parallel with the Winograd algorithm, and
A process of storing a plurality of the first matrix and a plurality of the second matrix of t rows and t columns in each of the q storage areas using the calculated combination of t and q, and q of the above. Outputs a program for causing a computer having the calculation core and the register to perform a process in which each of the calculation cores performs a convolution calculation of the first matrix and the second matrix using the Winograd algorithm. Processing to do and
An information processing method characterized by a computer executing.

１０…計算機、１１…メインメモリ、１２…プロセッサ、１３…バス、２０…レジスタファイル、２１…情報処理プログラム、３１…情報処理装置、３２…記憶装置、３３…メインメモリ、３４…プロセッサ、３５…入力装置、３６…表示装置、３７…バス、３８…記録媒体、３９…情報処理プログラム、４１…出力部、４２…算出部、５０…プログラム、５１…受付部、５２…選択部、５３…格納部、５４…計算部、５５…出力部。
10 ... Computer, 11 ... Main memory, 12 ... Processor, 13 ... Bus, 20 ... Register file, 21 ... Information processing program, 31 ... Information processing device, 32 ... Storage device, 33 ... Main memory, 34 ... Processor, 35 ... Input device, 36 ... Display device, 37 ... Bus, 38 ... Recording medium, 39 ... Information processing program, 41 ... Output unit, 42 ... Calculation unit, 50 ... Program, 51 ... Reception unit, 52 ... Selection unit, 53 ... Storage Unit, 54 ... Calculation unit, 55 ... Output unit.

Claims

The total number of elements of each of the plurality of first matrices and the plurality of second matrices of t rows and t columns does not exceed the number of data that can be stored in each of q of the plurality of storage areas of the register. Of the combinations of and q, each of the q calculation cores corresponding to each of the q storage areas performs a convolution calculation of a plurality of the first matrix and a plurality of the second matrices by the Winograd algorithm. A calculation unit that calculates the combination that minimizes the calculation time when executing in parallel,
A process of storing a plurality of the first matrix and a plurality of the second matrix of t rows and t columns in each of the q storage areas using the calculated combination of t and q, and q of the above. Outputs a program for causing a computer having the calculation core and the register to perform a process in which each of the calculation cores performs a convolution calculation of the first matrix and the second matrix using the Winograd algorithm. Output section and
An information processing device characterized by having.

The information processing apparatus according to claim 1, wherein each of the first matrix and the second matrix is a matrix in a convolutional layer of deep learning.

Each of the plurality of second matrices is identified by a combination of a first identifier and a second identifier.
The program
To have the computer execute a process of storing each of the q second matrices having the same second identifier in each of the q storage areas, which have different first identifiers from each other. The information processing apparatus according to claim 1.

The total number of elements of each of the plurality of first matrices and the plurality of second matrices of t rows and t columns does not exceed the number of data that can be stored in each of q of the plurality of storage areas of the register. Of the combinations of and q, each of the q calculation cores corresponding to each of the q storage areas performs a convolution calculation of a plurality of the first matrix and a plurality of the second matrices by the Winograd algorithm. The process of calculating the combination that minimizes the calculation time when executing in parallel, and
A process of storing a plurality of the first matrix and a plurality of the second matrix of t rows and t columns in each of the q storage areas using the calculated combination of t and q, and q of the above. Outputs a program for causing a computer having the calculation core and the register to perform a process in which each of the calculation cores performs a convolution calculation of the first matrix and the second matrix using the Winograd algorithm. Processing to do and
An information processing program that allows a computer to execute.

The total number of elements of each of the plurality of first matrices and the plurality of second matrices of t rows and t columns does not exceed the number of data that can be stored in each of q of the plurality of storage areas of the register. Of the combinations of and q, each of the q calculation cores corresponding to each of the q storage areas performs a convolution calculation of a plurality of the first matrix and a plurality of the second matrices by the Winograd algorithm. The process of calculating the combination that minimizes the calculation time when executing in parallel, and
A process of storing a plurality of the first matrix and a plurality of the second matrix of t rows and t columns in each of the q storage areas using the calculated combination of t and q, and q of the above. Outputs a program for causing a computer having the calculation core and the register to perform a process in which each of the calculation cores performs a convolution calculation of the first matrix and the second matrix using the Winograd algorithm. Processing to do and
An information processing method characterized by a computer executing.