JP7420100B2

JP7420100B2 - Processing device, processing method, and program

Info

Publication number: JP7420100B2
Application number: JP2021041201A
Authority: JP
Inventors: 龍樹澤田
Original assignee: Omron Corp
Current assignee: Omron Corp
Priority date: 2021-03-15
Filing date: 2021-03-15
Publication date: 2024-01-23
Anticipated expiration: 2041-03-15
Also published as: JP2022141064A

Description

本発明は、畳み込みニューラルネットワークに用いる処理装置、処理方法、およびプログラムに関する。 The present invention relates to a processing device, a processing method, and a program for use in a convolutional neural network.

従来、機械学習の分野において、畳み込みニューラルネットワーク（ＣＮＮ；Ｃｏｎｖｏｌｕｔｉｏｎａｌｎｅｕｒａｌｎｅｔｗｏｒｋ）と呼ばれるモデルを用いて、画像や動画の認識が行われている。例えば、画像の認識では、畳み込み層とプーリング層を使って入力画像を変換しながら、データ量を徐々に小さくしていき、最終的に各カテゴリの確率の値を出力する。 Conventionally, in the field of machine learning, images and videos have been recognized using a model called a convolutional neural network (CNN). For example, in image recognition, a convolution layer and a pooling layer are used to transform an input image, gradually reducing the amount of data, and finally outputting probability values for each category.

ここで、ＣＮＮの畳み込み層では、入力データにおけるそれぞれの局所領域（例えば、３×３のセルの領域）に対して、フィルタをかけ合わせること（フィルタ処理；畳み込み処理）が行われる。畳み込み処理では、入力データの同じセルのデータが何度も（例えばフィルタのセル（係数）の数と同じ回数）メインメモリ（ＲＡＭ）から読み出される可能性がある。しかも、１つの畳み込み層において、数十から数百のフィルタが用いられる。このため、メインメモリからのデータ読み出しの回数が極めて多く、これがＣＮＮの処理の高速化を阻むボトルネックとなっていた。 Here, in the convolution layer of the CNN, filtering (filter processing; convolution processing) is performed on each local region (for example, a 3×3 cell region) in the input data. In the convolution process, data of the same cell of input data may be read from the main memory (RAM) many times (for example, as many times as the number of cells (coefficients) of the filter). Moreover, tens to hundreds of filters are used in one convolutional layer. For this reason, the number of times data is read from the main memory is extremely large, and this has become a bottleneck that prevents speeding up of CNN processing.

特許文献１では、ＣＮＮの畳み込み層の効率的実装のために、複数のチャンネルに渡っている入力データを並び替えることが行われている。 In Patent Document 1, in order to efficiently implement a convolutional layer of a CNN, input data across a plurality of channels is rearranged.

国際公開第２０１８／０６７６０３号International Publication No. 2018/067603

しかし、特許文献１では、入力データを単純に分割して並び替えているだけなので、例えば、元の入力データにおいて分割の境目に位置するセルでは、並び替え後には隣接関係が大きく変化してしまう。このため、元の入力データにおいて分割の境目に位置するセルに対してフィルタ処理を行う場合には、並び替えた後のデータの様々な位置のセルを参照する必要がある。このため、メインメモリからレジスタへの並び替え後のデータの読み出し処理などに多くの処理数を要し、効率的な畳み込み層の処理が実現できなかった。 However, in Patent Document 1, the input data is simply divided and rearranged, so for example, in a cell located at the division boundary in the original input data, the adjacency relationship changes significantly after the rearrangement. . Therefore, when performing filter processing on cells located at division boundaries in original input data, it is necessary to refer to cells at various positions in rearranged data. For this reason, a large number of processes are required to read data after rearranging it from the main memory to the registers, making it impossible to realize efficient convolutional layer processing.

そこで、本発明は、畳み込みニューラルネットワークの畳み込み層の処理を効率的に実現することを目的とする。 Therefore, an object of the present invention is to efficiently realize processing of convolutional layers of a convolutional neural network.

上記目的を達成するために本発明は、以下の構成を採用する。 In order to achieve the above object, the present invention employs the following configuration.

すなわち、本発明の一側面に係る処理装置は、各セルにデータを有する２次元データである入力データであって、畳み込みニューラルネットワークの畳み込み層の入力データに対する畳み込み処理を行う処理装置であって、前記入力データから複数の領域を選択して、前記複数の領域のデータを連結しメインメモリ上の連続したアドレスに配置データとして配置する配置手段と、前記配置データをプロセッサにより読み出して、前記配置データに対してフィルタを用いた畳み込み処理を行う処理手段と、を有し、前記複数の領域のそ
れぞれの行方向の大きさは、プロセッサが一括で読み出し可能なセルの数の整数倍に対応する大きさであって、前記複数の領域のそれぞれは、前記フィルタのサイズから１を引いた数の列だけ他の領域の列と重複したデータを有することを特徴とする処理装置である。 That is, a processing device according to one aspect of the present invention is a processing device that performs convolution processing on input data of a convolution layer of a convolutional neural network, the input data being two-dimensional data having data in each cell, arranging means for selecting a plurality of areas from the input data, concatenating the data of the plurality of areas and arranging the data at consecutive addresses on a main memory as arrangement data; processing means for performing convolution processing using a filter on a cell, and the size of each of the plurality of regions in the row direction is a size corresponding to an integral multiple of the number of cells that can be read out at once by the processor. The processing device is characterized in that each of the plurality of regions has data that overlaps columns in other regions by a number of columns equal to the size of the filter minus 1.

このように、複数の領域のそれぞれの行方向の大きさが、プロセッサが一括で読み出し可能なセルの数（ワードサイズ）の整数倍であることによれば、一括で所定数のセルからデータを読み出せるプロセッサが、対象の領域から所定数未満のセルのデータしか読み出さないことを防げる。つまり、プロセッサの読み出し能力を最大限に活用できる。また、配置データには重畳領域が存在するため、フィルタ処理の対象のセルの周囲に、当該フィルタ処理に必要な他のセルが集まっている状態になる。このため、プロセッサは、メインメモリからの読み出し回数を低減しながら、畳み込み層のフィルタ処理を実行可能である。従って、畳み込みニューラルネットワークの畳み込み層の処理を効率的に実現できる。なお、ここで、フィルタのサイズとは、ｎセル×ｎセルのフィルタであれば、ｎセル分の大きさである。 In this way, if the size of each of the multiple areas in the row direction is an integral multiple of the number of cells (word size) that the processor can read at once, it is possible to read data from a predetermined number of cells at once. It is possible to prevent a processor capable of reading data from reading only data of less than a predetermined number of cells from a target area. In other words, the read capacity of the processor can be utilized to the fullest. Furthermore, since there is an overlapping region in the arrangement data, other cells necessary for the filtering process are gathered around the cell to be filtered. Therefore, the processor can perform filter processing on the convolutional layer while reducing the number of reads from the main memory. Therefore, the processing of the convolutional layer of the convolutional neural network can be efficiently realized. Note that, here, the size of the filter is the size of n cells in the case of a filter of n cells×n cells.

上記処理装置において、前記配置手段は、前記複数の領域のデータをメインメモリ上で列方向に連結（接続）して前記配置データとして配置してもよい。これによれば、複数の領域のうちの１つの領域に、当該領域に関するフィルタ処理を実行するために必要なセル（データ）が集約される。このため、プロセッサは、メインメモリからの読み出し回数を低減しながら、畳み込み層のフィルタ処理を実行可能である。従って、畳み込みニューラルネットワークの畳み込み層の処理を効率的に実現できる。 In the above processing device, the arrangement means may connect (connect) the data of the plurality of areas in a column direction on the main memory and arrange the data as the arrangement data. According to this, cells (data) necessary for performing filter processing regarding the area are aggregated in one area out of the plurality of areas. Therefore, the processor can perform filter processing on the convolutional layer while reducing the number of reads from the main memory. Therefore, the processing of the convolutional layer of the convolutional neural network can be efficiently realized.

上記処理装置において、前記配置手段は、前記配置データを配置する場合に、前記複数の領域のうち隣接する２つの領域の間に前記フィルタのデータを有する行を配置してもよい。これによれば、例えば、チャンネル数分のフィルタのデータを常に、プロセッサにおける複数のレジスタが格納しておく必要がなくなるため、レジスタをより効果的にフィルタ処理に用いることが可能になる。 In the above processing device, when arranging the arrangement data, the arrangement means may arrange a row having the filter data between two adjacent regions among the plurality of regions. According to this, for example, it is not necessary to always store filter data for the number of channels in a plurality of registers in the processor, so that the registers can be used more effectively for filter processing.

上記処理装置において、前記複数の領域のそれぞれは、さらに、前記フィルタのサイズから１を引いた数の行だけ他の領域の行と重複したデータを有していてもよい。行が重複していることによれば、フィルタ処理を実行する場合に、複数の領域のうちの他の領域を参照せずにフィルタ処理を実行することができる。このため、配置データにおいて一度読み出した行を再度読み出すことなく、畳み込み層の処理ができるため、畳み込み層の処理が効率化できる。 In the processing device, each of the plurality of regions may further include data that overlaps with rows in other regions by a number of rows equal to the size of the filter minus 1. Since the rows overlap, when performing filter processing, it is possible to perform filter processing without referring to other regions among the plurality of regions. Therefore, the convolutional layer can be processed without re-reading the rows that have been read out once in the arrangement data, so that the convolutional layer processing can be made more efficient.

上記処理装置において、前記配置データの各行の先頭のメモリアドレスは、前記プロセッサが一括で読み出せるメモリアドレスの数の整数倍のメモリアドレスであってもよい。これによれば、配置データの各行の先頭のメモリアドレスを、メインメモリのメモリブロックの先頭のメモリアドレスに合わせることができる。このため、不要なメモリブロックにアクセスすることを抑制できるため、プロセッサのメインメモリからの読み出し処理が効率化する。 In the above processing device, the first memory address of each row of the arrangement data may be a memory address that is an integral multiple of the number of memory addresses that the processor can read out at once. According to this, the memory address at the beginning of each row of arrangement data can be matched with the memory address at the beginning of the memory block of the main memory. Therefore, accessing unnecessary memory blocks can be suppressed, so that reading processing from the main memory of the processor becomes more efficient.

上記処理装置において、前記入力データは、複数のチャンネルに渡るデータであり、前記配置手段は、前記複数の領域に対して列、行、チャンネルの順で優先順序を決定し、前記複数の領域のデータを前記優先順序に従いメインメモリ上で列方向に連結して前記配置データとして配置してもよい。このような配置によれば、フィルタ処理を行う場合に、同じ列かつ同じ行でチャンネルのみが異なる領域に対して、連続的に処理を実行することができる。このため、１つのチャンネルに対して実行したフィルタ処理の結果を中間結果としてレジスタに格納させている間に、他のレジスタを用いて他のチャンネルのフィルタ処理を実行できる。つまり、１つのチャンネルの中間結果をメインメモリに読み書きする処
理の発生を抑制できるので、全てのチャンネルについての中間結果を合計する必要のある畳み込み層の処理を効率的に実行できる。 In the above processing device, the input data is data spanning a plurality of channels, and the arrangement means determines a priority order for the plurality of regions in the order of column, row, and channel, and The data may be connected in a column direction on the main memory according to the priority order and arranged as the arrangement data. According to such an arrangement, when performing filter processing, it is possible to continuously perform the processing on regions in the same column and the same row that differ only in channels. Therefore, while the result of filter processing performed on one channel is stored in a register as an intermediate result, filter processing on another channel can be performed using another register. In other words, it is possible to suppress the occurrence of the process of reading and writing the intermediate results of one channel to the main memory, so that the processing of the convolution layer that requires summing the intermediate results of all channels can be efficiently executed.

上記処理装置において、前記処理手段は、前記畳み込み処理を行う場合には、前記フィルタのサイズ分の数の行のデータブロックであって、前記配置データにおいて列方向に連続する第１の複数の行のデータブロックを前記配置データから読み出してレジスタに格納し、前記第１の複数の行の各行のデータブロックに対して、行方向へのシフトする処理およびフィルタの１つのセルの値を乗算する処理を行ってもよい。これによれば、１つのデータブロックが有する複数のデータに対して一括で演算処理を適用することができるので、結果的に、複数の局所領域に対して一括でフィルタ処理を実行することができる。このため、畳み込みニューラルネットワークの畳み込み層の処理を効率的に実現できる。 In the above processing device, when performing the convolution process, the processing means is configured to generate a data block having a number of rows equal to the size of the filter, and a first plurality of rows that are continuous in the column direction in the arrangement data. reading the data block from the arrangement data and storing it in a register, shifting the data block in each row of the first plurality of rows in the row direction, and multiplying the data block by the value of one cell of the filter. You may do so. According to this, arithmetic processing can be applied to multiple pieces of data included in one data block at once, and as a result, filter processing can be executed on multiple local areas at once. . Therefore, the processing of the convolutional layer of the convolutional neural network can be efficiently realized.

上記処理装置において、前記処理手段は、前記畳み込み処理を行う場合には、前記第１の複数の行の各行のデータブロックを１セル分だけ行方向にシフトしたデータブロックから、前記第１の複数の行の各行のデータブロックを前記フィルタのサイズから１引いた数のセル分だけ行方向にシフトしたデータブロックまでのそれぞれのデータブロックと前記第１の複数の行の各行のデータブロックとを取得し、取得したデータブロックのそれぞれに対して、前記第１の複数の行のうちの当該データブロックが対応する行と当該データブロックをシフトした量とに対応する前記フィルタのセルが示す値を乗算することにより、第２の複数の行のデータブロックを取得し、前記第２の複数の行の各行のデータブロックにおける対応する位置のセルの値同士を合計してもよい。 In the above processing device, when performing the convolution process, the processing means shifts the data blocks of each row of the first plurality of rows by one cell in the row direction to the first plurality of data blocks. obtain each data block of each row of the rows up to the data block shifted in the row direction by a number of cells equal to the size of the filter minus 1, and the data block of each row of the first plurality of rows. and multiplying each of the obtained data blocks by the value indicated by the cell of the filter corresponding to the row to which the data block corresponds among the first plurality of rows and the amount by which the data block is shifted. By doing so, the data blocks of the second plurality of rows may be obtained, and the values of cells at corresponding positions in the data blocks of each row of the second plurality of rows may be summed.

本発明は、上記手段の少なくとも一部を有する装置として捉えてもよいし、電子機器や制御システム、情報処理システム、情報処理装置、処理システム、データ配置装置として捉えてもよい。また、本発明は、上記処理の少なくとも一部を含む制御方法、処理方法、配置方法して捉えてもよい。また、本発明は、かかる方法を実現するためのプログラムやそのプログラムを非一時的に記録した記録媒体（記憶媒体）として捉えることもできる。なお、上記手段および処理の各々は可能な限り互いに組み合せて本発明を構成することができる。 The present invention may be understood as a device having at least a part of the above means, or as an electronic device, a control system, an information processing system, an information processing device, a processing system, or a data arrangement device. Further, the present invention may be understood as a control method, a processing method, and an arrangement method that include at least a part of the above processing. Further, the present invention can also be understood as a program for realizing such a method and a recording medium (storage medium) on which the program is recorded non-temporarily. Note that each of the above means and processes can be combined to the extent possible to constitute the present invention.

本発明によれば、畳み込みニューラルネットワークの畳み込み層の処理を効率的に実現できる。 According to the present invention, processing of convolutional layers of a convolutional neural network can be efficiently realized.

図１Ａは処理装置の簡易的な構成図であり、図１Ｂはプロセッサの内部構成図である。FIG. 1A is a simple configuration diagram of a processing device, and FIG. 1B is an internal configuration diagram of a processor. 図２は、畳み込み層の処理を説明する図である。FIG. 2 is a diagram illustrating processing of a convolutional layer. 図３は、データ配置処理のフローチャートである。FIG. 3 is a flowchart of data arrangement processing. 図４Ａは入力データを表す図であり、図４Ｂは配置データを表す図である。FIG. 4A is a diagram showing input data, and FIG. 4B is a diagram showing layout data. 図５は、選択領域の優先順序を説明する図である。FIG. 5 is a diagram illustrating the priority order of selected areas. 図６は、データ配置処理を行わない場合のフィルタ処理を説明する図である。FIG. 6 is a diagram illustrating filter processing when data arrangement processing is not performed. 図７は、フィルタ処理を説明する図である。FIG. 7 is a diagram illustrating filter processing.

以下、本発明を実施するための実施形態について図面を用いて記載する。 Embodiments for carrying out the present invention will be described below with reference to the drawings.

まず、畳み込みニューラルネットワーク（ＣＮＮ）について説明する。ＣＮＮは、畳み込み層とプーリング層を含む。画像などの入力データは、畳み込み層とプーリング層とに
おける処理のセットが繰り返し実行されることによって、データ量が減少していく。そして、最終的には、例えば、当該画像が所定の物体（例えば、人、顔、犬などの動物）である確率の値が出力データとして出力される。 First, a convolutional neural network (CNN) will be explained. CNN includes convolution layers and pooling layers. The amount of input data such as images is reduced by repeatedly performing a set of processes in the convolution layer and the pooling layer. Finally, for example, a value of the probability that the image is a predetermined object (for example, a person, a face, an animal such as a dog) is output as output data.

畳み込み層では、例えば、図２に示すように、画像の各画素のＲ，Ｇ，Ｂのそれぞれチャンネルについて、画素値を２次元状に示したマトリクスが入力データとして入力される。そして、畳み込み層では、入力画像の各チャンネルの局所領域２０１～２０３（例えば、３×３の領域）に対して、カーネルと呼ばれるフィルタ２１１～２１３をかけ合わせる（適用する）。このように、一般的には、チャンネルの数、またはそれ以上の数だけ、入力データに適用するためのフィルタが必要になる。その後、局所領域２０１にフィルタ２１１をかけた値と、局所領域２０２にフィルタ２１２をかけた値と、局所領域２０３にフィルタ２１３をかけた値とを、合計することによって当該局所領域（当該局所領域の中心のセル）に対応する出力２２０（出力値Ｖ２２）を得る。ここで、例えば、局所領域２０１にフィルタ２１１をかけた結果は、局所領域２０１とフィルタ２１１とにおける３×３の領域の対応する位置の数値同士を乗算した結果を合計する処理によって、式１に示す値のように算出できる。なお、以下では、入力データ（局所領域）に基づき出力２２０を有するような出力データを生成する処理を「畳み込み処理」と呼ぶ。また、「畳み込み処理」の一部の処理であって、局所領域に対してフィルタをかけ合わせる処理を、「フィルタ処理」と呼ぶ。

In the convolution layer, for example, as shown in FIG. 2, a matrix showing pixel values in a two-dimensional form for each of the R, G, and B channels of each pixel of an image is input as input data. Then, in the convolution layer, filters 211 to 213 called kernels are multiplied (applied) to local regions 201 to 203 (for example, 3×3 regions) of each channel of the input image. Thus, typically as many filters as there are channels or more are needed to apply to the input data. Then, by summing the value obtained by applying the filter 211 to the local area 201, the value obtained by applying the filter 212 to the local area 202, and the value obtained by applying the filter 213 to the local area 203, the value obtained by applying the filter 211 to the local area 201 is summed. An output 220 (output value V22) corresponding to the center cell) is obtained. Here, for example, the result of applying the filter 211 to the local area 201 is obtained by summing the results of multiplying the numerical values at corresponding positions in the 3×3 area in the local area 201 and the filter 211, using equation 1. It can be calculated as shown below. Note that hereinafter, the process of generating output data having the output 220 based on input data (local region) will be referred to as "convolution process." Further, a process that is a part of the "convolution process" and that applies a filter to a local region is called a "filter process."

プーリング層では、畳み込み層で処理された後のマトリクスについて、局所領域ごとに、情報を処理して出力する。例えば、２×２の領域における最大値や平均値などが、その局所領域に対応する出力値として出力される。 The pooling layer processes and outputs information for each local region of the matrix processed by the convolutional layer. For example, the maximum value, average value, etc. in a 2×2 area are output as output values corresponding to that local area.

＜適用例＞
以下では、各セルにデータを有する２次元データである入力データであって、畳み込みニューラルネットワークの畳み込み層の入力データに対する畳み込み処理（フィルタ処理）を行う処理装置１について説明する。処理装置１は、入力データから複数の領域（選択領域）を選択して、複数の選択領域のデータを連結しメインメモリ上の連続したアドレスに配置データとして配置する。そして、処理装置１は、配置データをプロセッサにより読み出して、畳み込み処理を行う。このとき、処理装置１は、複数の選択領域のそれぞれの行方向の大きさを、プロセッサが一括で読み出し可能なセルの数（ワードサイズ）の整数倍に対応する大きさにする。また、処理装置１は、複数の選択領域のそれぞれが、フィルタサイズから１を引いた数の列だけ他の選択領域の列と重畳したデータ（重畳領域）を有するように、複数の選択領域を選択する。 <Application example>
In the following, a processing device 1 will be described that performs convolution processing (filter processing) on input data, which is two-dimensional data having data in each cell, of a convolution layer of a convolution neural network. The processing device 1 selects a plurality of areas (selected areas) from input data, concatenates the data of the plurality of selected areas, and arranges the data at consecutive addresses on the main memory as placement data. Then, the processing device 1 uses a processor to read out the arrangement data and performs convolution processing. At this time, the processing device 1 sets the size of each of the plurality of selected areas in the row direction to a size corresponding to an integral multiple of the number of cells (word size) that can be read out at once by the processor. Furthermore, the processing device 1 divides the plurality of selection regions so that each of the plurality of selection regions has data (superimposed region) that overlaps columns of other selection regions by the number of columns obtained by subtracting 1 from the filter size. select.

複数の選択領域のそれぞれの行方向の大きさがワードサイズの整数倍であることによれば、一括で所定数のセルからデータを読み出せるプロセッサが、対象の選択領域から所定数未満のセルのデータしか読み出さないことを防げる。つまり、プロセッサの読み出し能力を最大限に活用できる。また、配置データには重畳領域が存在するため、フィルタ処理の対象のセルの周囲に、当該フィルタ処理に必要な他のセルが集まっている状態になる。このため、プロセッサは、メインメモリからの読み出し回数を低減しながら、畳み込み処理を実行可能である。従って、畳み込みニューラルネットワークの畳み込み層の処理を効率的に実現できる。 If the size of each of the multiple selection areas in the row direction is an integral multiple of the word size, a processor that can read data from a predetermined number of cells at once can read data from less than the predetermined number of cells from the target selection area. This prevents only data from being read. In other words, the read capacity of the processor can be utilized to the fullest. Further, since there is an overlapping region in the arrangement data, other cells necessary for the filtering process are gathered around the cell to be filtered. Therefore, the processor can execute the convolution process while reducing the number of reads from the main memory. Therefore, processing of the convolutional layer of the convolutional neural network can be efficiently realized.

＜実施形態１＞
以下では、図１Ａ、図１Ｂを参照して、実施形態１に係る処理装置１の構成について説明する。図１Ａは、処理装置１の簡易的な構成図である。処理装置１は、ＰＣ、サーバ、スマートフォンなどの任意の処理装置（処理端末）であってよい。処理装置１は、プロセッサ１０、記憶装置２０、入出力装置３０、バス４０を有する。 <Embodiment 1>
The configuration of the processing device 1 according to the first embodiment will be described below with reference to FIGS. 1A and 1B. FIG. 1A is a simple configuration diagram of the processing device 1. As shown in FIG. The processing device 1 may be any processing device (processing terminal) such as a PC, a server, or a smartphone. The processing device 1 includes a processor 10, a storage device 20, an input/output device 30, and a bus 40.

プロセッサ１０（ＣＰＵ；中央処理装置）は、処理装置１における各構成（装置）を制御する。例えば、入出力装置３０に対して入力されたユーザ指示に従って、記憶装置２０に記憶されたデータを用いた制御をする。プロセッサ１０は、複数のレジスタ１１（一般的には、１６個または３２個のレジスタ１１）を有する。 A processor 10 (CPU; central processing unit) controls each component (device) in the processing device 1 . For example, control is performed using data stored in the storage device 20 in accordance with user instructions input to the input/output device 30. Processor 10 has a plurality of registers 11 (generally 16 or 32 registers 11).

複数のレジスタ１１のそれぞれは、データを一時的に格納する記憶回路である。複数のレジスタ１１のそれぞれは、記憶装置２０（メインメモリ２１）よりも高速にデータの読み書きをすることができる。このため、プロセッサ１０は、複数のレジスタ１１にデータを一時的に格納しながら、各種の処理を実行する。なお、複数のレジスタ１１は、演算などの用途が特定された複数の専用レジスタと、用途が特定されていない複数の汎用レジスタを有する。複数のレジスタ１１のそれぞれは、例えば、１６ｂｉｔのデータを８セット格納することができる。このため、例えば、複数のレジスタ１１のそれぞれは、入力データの８つのセルのデータを一括で格納することもできる。 Each of the plurality of registers 11 is a storage circuit that temporarily stores data. Each of the plurality of registers 11 can read and write data faster than the storage device 20 (main memory 21). Therefore, the processor 10 executes various processes while temporarily storing data in the plurality of registers 11. Note that the plurality of registers 11 include a plurality of dedicated registers whose uses such as arithmetic operations are specified, and a plurality of general-purpose registers whose uses are not specified. Each of the plurality of registers 11 can store, for example, eight sets of 16-bit data. Therefore, for example, each of the plurality of registers 11 can store data of eight cells of input data at once.

記憶装置２０は、プロセッサ１０が処理を行うためのデータを記憶（記録）する。記憶装置２０は、ハードディスク、ＲＡＭ（メインメモリ２１；ＲａｎｄｏｍＡｃｃｅｓｓ
Ｍｅｍｏｒｙ）、データを非一時的に記憶するＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）を含む。ＲＯＭは、例えば、ＯＳ（ＯｐｅｒａｔｉｏｎＳｙｓｔｅｍ）のバイオスや、プロセッサ１０が動作するためのプログラムを記憶する。 The storage device 20 stores (records) data for the processor 10 to process. The storage device 20 includes a hard disk, RAM (main memory 21; Random Access
ROM (Read Only Memory) that stores data non-temporarily. The ROM stores, for example, a bios of an OS (operation system) and a program for the processor 10 to operate.

メインメモリ２１は、入力データやフィルタなどのデータを一時的に記憶する。メインメモリ２１では、１ビットの情報を記憶するメモリセルが２次元状（行方向および列方向）に広がっている。なお、メインメモリ２１は処理装置１から取り外し可能であってもよく、この場合には、処理装置１をメモリ制御装置として捉えることができる。なお、上述のように、フィルタは、入力データのチャンネルの数（またはそれ以上の数）だけ必要であるので、メインメモリ２１には入力データのチャンネル分のフィルタのデータが記憶される。 The main memory 21 temporarily stores data such as input data and filters. In the main memory 21, memory cells that store 1-bit information are spread out two-dimensionally (in the row and column directions). Note that the main memory 21 may be removable from the processing device 1, and in this case, the processing device 1 can be regarded as a memory control device. Note that, as described above, since the number of filters is equal to the number of channels of input data (or more), the main memory 21 stores filter data for the channels of input data.

また、メインメモリ２１は、所定のブロック（以下、メモリブロックと称する）単位でデータを記憶する。このため、例えば、１つのメモリブロックのサイズが４Ｂｙｔｅである場合には、各メモリブロックの先頭のメモリアドレスは、４Ｂｙｔｅの整数倍になる。一方で、プロセッサ１０は、ワード（プロセッサ１０が一括して読み出し可能なセルの数またはビット数）単位で、メインメモリ２１からデータを一括で読み出すことができる。このため、プロセッサ１０が読み出し要求するデータが、メインメモリ２１における２つのメモリブロックに渡っている場合には、プロセッサ１０は２回のメモリアクセスをする必要がある。従って、プロセッサ１０がメインメモリ２１にアクセスする場合には、アクセスの対象となるメモリブロックの数を少なくすることが、高速なメモリアクセスを実現する。なお、プロセッサ１０は、１行における複数のセルから一括でデータを読み出すことはできるが、１列における複数のセルから一括でデータを読み出すことはできない。 The main memory 21 also stores data in units of predetermined blocks (hereinafter referred to as memory blocks). Therefore, for example, if the size of one memory block is 4 Bytes, the first memory address of each memory block will be an integral multiple of 4 Bytes. On the other hand, the processor 10 can read data at once from the main memory 21 in units of words (the number of cells or bits that can be read out at once by the processor 10). Therefore, if the data that the processor 10 requests to read spans two memory blocks in the main memory 21, the processor 10 needs to access the memory twice. Therefore, when the processor 10 accesses the main memory 21, reducing the number of memory blocks to be accessed achieves high-speed memory access. Note that although the processor 10 can read data at once from a plurality of cells in one row, it cannot read data at once from a plurality of cells in one column.

入出力装置３０は、ユーザの指示（操作）を受け付ける入力装置と、プロセッサ１０による処理後のデータを出力する出力装置を有する。入力装置は、例えば、マウス、キーボード、タッチパネル、ボタン、ダイヤル、マイク（音声入力装置）、姿勢検知装置（ジャ
イロセンサ）、温度センサなどを含む。出力装置は、例えば、ディスプレイ（表示装置）、スピーカー（音声出力装置）、プリンタなどを含む。 The input/output device 30 includes an input device that receives instructions (operations) from a user, and an output device that outputs data processed by the processor 10. Input devices include, for example, a mouse, a keyboard, a touch panel, a button, a dial, a microphone (voice input device), an attitude detection device (gyro sensor), a temperature sensor, and the like. The output device includes, for example, a display (display device), a speaker (sound output device), a printer, and the like.

バス４０は、プロセッサ１０と記憶装置２０と入出力装置３０との間の通信を行うための経路である。プロセッサ１０は、バス４０を介して、記憶装置２０からデータを取得して、当該データを用いて処理を実行し、処理後のデータを記憶装置２０に記憶させることができる。 The bus 40 is a path for communication between the processor 10, the storage device 20, and the input/output device 30. The processor 10 can acquire data from the storage device 20 via the bus 40, execute processing using the data, and store the processed data in the storage device 20.

（プロセッサの内部構成）
続いて、図１Ｂを参照して、プロセッサ１０の内部構成を説明する。図１Ｂは、入力データの再配置や畳み込み処理（フィルタ処理）などに用いられるプロセッサ１０の内部構成図である。プロセッサ１０は、取得部１０１、配置部１０２、処理部１０３を有する。これらの構成は、専用のロジック回路により実現されてもよいし、プロセッサ１０がプログラムを実行することによりソフトウエア的に実現されてもよい。 (Internal configuration of processor)
Next, the internal configuration of the processor 10 will be explained with reference to FIG. 1B. FIG. 1B is an internal configuration diagram of the processor 10 used for rearranging input data, convolution processing (filter processing), and the like. The processor 10 includes an acquisition section 101, an arrangement section 102, and a processing section 103. These configurations may be realized by a dedicated logic circuit, or may be realized in software by the processor 10 executing a program.

取得部１０１は、ＣＮＮの入力データ（図２参照）をメインメモリ２１やプーリング層から取得する。ここで、例えば、入力データは、複数のチャンネル（例えば、それぞれがＲＢＧのいずれかの画素値を示すチャンネルＲ、チャンネルＢ、チャンネルＢ）に渡るデータを有する。各チャンネルにおいて、１つ１つがデータを有するセルが２次元状に広がっている。なお、入力データは、複数のチャンネルを有している必要はなく、単一のチャンネルのみを有していてもよい。 The acquisition unit 101 acquires CNN input data (see FIG. 2) from the main memory 21 or the pooling layer. Here, for example, the input data includes data across a plurality of channels (for example, channel R, channel B, and channel B each indicating a pixel value of RBG). In each channel, cells each containing data are spread out in two dimensions. Note that the input data does not need to have multiple channels, and may have only a single channel.

配置部１０２は、入力データにおいて、所定の大きさの複数の領域を選択領域として選択する。また、配置部１０２は、選択領域のデータを所定の順序（優先順序）で連結（接続）するように配置していき、配置データとしてメインメモリ２１に記憶させる。これによって、入力データを並び変えた配置データがメインメモリ２１に記憶される。そして、処理部１０３は、メインメモリ２１に記憶された配置データをレジスタ１１に読み出して、畳み込み層における畳み込み処理（フィルタ処理）を実行する。 The placement unit 102 selects a plurality of areas of a predetermined size as selection areas in the input data. Further, the arrangement unit 102 arranges the data in the selected area so as to connect (connect) the data in a predetermined order (priority order), and stores the data in the main memory 21 as arrangement data. As a result, arrangement data obtained by rearranging the input data is stored in the main memory 21. Then, the processing unit 103 reads out the arrangement data stored in the main memory 21 into the register 11, and executes convolution processing (filter processing) in the convolution layer.

［データ配置処理］
以下では、上述したデータ配置処理の詳細な処理を図３のフローチャートを用いて説明する。図３のフローチャートの処理は、プロセッサ１０がＲＯＭに記憶されたプログラムを実行することによって実現する。 [Data placement processing]
Below, detailed processing of the data arrangement processing mentioned above will be explained using the flowchart of FIG. 3. The processing in the flowchart of FIG. 3 is realized by the processor 10 executing a program stored in the ROM.

ステップＳ１００１では、取得部１０１は、ＣＮＮの畳み込み層に用いられる入力データを取得する。 In step S1001, the acquisition unit 101 acquires input data used in the convolution layer of the CNN.

ステップＳ１００２では、配置部１０２は、入力データから複数の領域を選択領域として選択する。ここでは、複数の選択領域のそれぞれの行方向の大きさが、上述のワード（プロセッサ１０が一括で読み出し可能なセルの数）のサイズの整数倍であるように、配置部１０２は、複数の選択領域を選択する。また、複数の選択領域のそれぞれが、フィルタのサイズから１を引いた数の列だけ他の選択領域の列と重複したデータを有するように、配置部１０２は、複数の選択領域を選択する。ここで、フィルタのサイズとは、３×３のセルを有するフィルタであれば、３セル分の大きさである。なお、複数の選択領域のそれぞれが、フィルタのサイズから１を引いた数の行だけ他の選択領域の行と重複したデータを有していてもよい。 In step S1002, the placement unit 102 selects a plurality of areas from the input data as selection areas. Here, the arrangement unit 102 arranges the plurality of selection areas so that the size in the row direction of each of the plurality of selection areas is an integral multiple of the size of the above-mentioned word (the number of cells that can be read out at once by the processor 10). Select a selection area. Further, the arrangement unit 102 selects the plurality of selection regions so that each of the plurality of selection regions has data that overlaps with columns of other selection regions by the number of columns obtained by subtracting 1 from the size of the filter. Here, the filter size is the size of three cells in the case of a filter having 3×3 cells. Note that each of the plurality of selection regions may have data that overlaps with rows of other selection regions by the number of rows obtained by subtracting 1 from the size of the filter.

ステップＳ１００３では、配置部１０２は、選択した全ての選択領域から、配置する順序を表す優先順序を決定する。具体的には、配置部１０２は、列、行、チャンネルの順で、全ての選択領域における優先順序を決定する。 In step S1003, the placement unit 102 determines a priority order representing the placement order from all selected selection areas. Specifically, the arrangement unit 102 determines the priority order in all selection areas in the order of column, row, and channel.

例えば、配置部１０２が、図４Ａに示すような選択領域ＣＨ＿Ｒ１１，ＣＨ＿Ｒ２１，ＣＨ＿Ｒ３１，ＣＨ＿Ｇ１１，ＣＨ＿Ｇ２１，ＣＨ＿Ｇ３１，ＣＨ＿Ｂ１１，ＣＨ＿Ｂ２１，ＣＨ＿Ｂ３１を、入力データから選択したと仮定する。この場合には、チャンネルよりも行が優先されるため、配置部１０２は、選択領域ＣＨ＿Ｒ１１，ＣＨ＿Ｇ１１，ＣＨ＿Ｂ１１，ＣＨ＿Ｒ２１，ＣＨ＿Ｇ２１，ＣＨ＿Ｂ２１，ＣＨ＿Ｒ３１，ＣＨ＿Ｇ３１，ＣＨ＿Ｂ３１の順を優先順序として決定する（図４Ｂ参照）。 For example, assume that the placement unit 102 selects selection regions CH_R11, CH_R21, CH_R31, CH_G11, CH_G21, CH_G31, CH_B11, CH_B21, and CH_B31 as shown in FIG. 4A from the input data. In this case, since rows are prioritized over channels, the placement unit 102 determines the order of selection areas CH_R11, CH_G11, CH_B11, CH_R21, CH_G21, CH_B21, CH_R31, CH_G31, and CH_B31 as the priority order (FIG. 4B reference).

また、例えば、図５に示すような選択領域ＣＨ＿Ｒ１１，ＣＨ＿Ｒ１２，ＣＨ＿Ｒ１３，ＣＨ＿Ｒ２１，ＣＨ＿Ｒ２２，ＣＨ＿Ｒ２３，ＣＨ＿Ｒ３１，ＣＨ＿Ｒ３２，ＣＨ＿Ｒ３３を、入力データから選択したと仮定する。この場合には、行よりも列が優先されるため、配置部１０２は、選択領域ＣＨ＿Ｒ１１，ＣＨ＿Ｒ２１，ＣＨ＿Ｒ３１，ＣＨ＿Ｒ１２，ＣＨ＿Ｒ２２，ＣＨ＿Ｒ３２，ＣＨ＿Ｒ１３，ＣＨ＿Ｒ２３，ＣＨ＿Ｒ３３の順を優先順序として決定する。 Further, for example, it is assumed that selection regions CH_R11, CH_R12, CH_R13, CH_R21, CH_R22, CH_R23, CH_R31, CH_R32, and CH_R33 as shown in FIG. 5 are selected from the input data. In this case, since columns are prioritized over rows, the arrangement unit 102 determines the order of the selected regions CH_R11, CH_R21, CH_R31, CH_R12, CH_R22, CH_R32, CH_R13, CH_R23, and CH_R33 as the priority order.

ステップＳ１００４では、配置部１０２は、優先順序に従って、選択領域のデータを連結して配置データを生成する。例えば、配置部１０２は、図４Ａに示すように、入力データから選択領域を選択していた場合には、選択領域ＣＨ＿Ｒ１１を先頭に、選択領域ＣＨ＿Ｇ１１，ＣＨ＿Ｂ１１，ＣＨ＿Ｒ２１と優先順序に従って列方向に連結するように配置していく。配置部１０２は、このように選択した複数の選択領域を配置することによって、配置データを生成する。 In step S1004, the arrangement unit 102 generates arrangement data by concatenating the data of the selected areas according to the priority order. For example, as shown in FIG. 4A, when the selection area is selected from the input data, the arrangement unit 102 connects the selection area CH_R11 to the selection area CH_G11, CH_B11, and CH_R21 in the column direction according to the priority order. Arrange them as you like. The arrangement unit 102 generates arrangement data by arranging the plurality of selection areas selected in this way.

ステップＳ１００５では、配置部１０２は、配置データをメインメモリ２１に記憶させる。なお、配置部１０２は、配置データの各行の先頭のメモリアドレスが、プロセッサ１０が一括で読み出せるメモリアドレスの数（個数）の整数倍のメモリアドレスであるように、配置データを配置するとよい。ここで、プロセッサ１０が一括で読み出せるメモリアドレスの数（個数）は一般的にメインメモリ２１の１つのメモリブロックが含むメモリアドレスの数と同等または整数倍である。このため、配置データの各行の先頭のメモリアドレスを、メインメモリ２１のメモリブロックの先頭のメモリアドレスに一致させることができる。従って、メインメモリ２１のワードサイズ分のセルからプロセッサ１０がデータを一括で読み出す場合に、読み出しの対象のメモリブロックの数を少なくできるため、メモリアクセスを行う回数を少なくできる。 In step S1005, the placement unit 102 stores the placement data in the main memory 21. Note that the arrangement unit 102 preferably arranges the arrangement data so that the first memory address of each row of the arrangement data is a memory address that is an integral multiple of the number of memory addresses that the processor 10 can read out at once. Here, the number of memory addresses that the processor 10 can read out at once is generally equal to or an integral multiple of the number of memory addresses included in one memory block of the main memory 21. Therefore, the memory address at the beginning of each row of the arrangement data can be made to match the memory address at the beginning of the memory block of the main memory 21. Therefore, when the processor 10 reads data at once from word-sized cells of the main memory 21, the number of memory blocks to be read can be reduced, so the number of memory accesses can be reduced.

本実施形態のように、複数の選択領域それぞれの行方向の大きさがワードのサイズの整数倍であれば、プロセッサ１０が１つの選択領域の１つの行を読み出す場合に、選択領域からワードサイズずつ読み出して、ワードサイズずつの読み出しから余ったセルのみ読み出すといった処理が不要になる。従って、プロセッサ１０によるメインメモリの不要な読み出しを減らすことができるので、畳み込み層での処理が効率化する。 As in this embodiment, if the size in the row direction of each of a plurality of selected areas is an integer multiple of the word size, when the processor 10 reads one row of one selected area, the word size from the selected area is There is no need to read out cells in units of word size and then read out only the remaining cells. Therefore, unnecessary reading of the main memory by the processor 10 can be reduced, so that processing in the convolution layer becomes more efficient.

なお、上述のようにプロセッサ１０は、１行に並んだ複数のセルから一括でデータを読み出すことはできるが、１列に並んだ複数のセルから一括でデータを読み出すことはできない。従って、複数の選択領域それぞれの列方向の大きさは、フィルタサイズよりも大きければ任意の大きさであってもよい。ここで、レジスタ１１の数が十分に多い場合には、複数の選択領域それぞれの列方向の大きさを大きくすることで、フィルタ処理の対象の選択領域の切り替わりを少なくできる。このため、レジスタ１１へのデータの読み込みおよびレジスタ１１のデータの解放の処理を少なくできる。一方で、レジスタ１１の数が少ない場合には、複数の選択領域それぞれの列方向の大きさを小さくすることで、１つの選択領域に対するフィルタ処理の算出結果（中間結果）をレジスタ１１に格納しておくだけの余裕ができる。このため、結果的に中間結果をメインメモリ２１に一時的に書き込み、その後、中間結果をメインメモリ２１から読み出すという処理の発生を抑制できる。従って
、プロセッサ１０は、複数の選択領域それぞれの列方向の大きさを、レジスタ１１の個数などに基づき、決定するとよい。また、プロセッサ１０は、複数の選択領域それぞれの列方向の大きさを実験的に様々な大きさにして畳み込み層の処理を行って、結果として最も早く処理が可能なものを最終的な複数の選択領域それぞれの列方向の大きさとしてもよい。 Note that, as described above, the processor 10 can read data at once from a plurality of cells arranged in one row, but cannot read data at once from a plurality of cells arranged in one column. Therefore, the size of each of the plurality of selection areas in the column direction may be any size as long as it is larger than the filter size. Here, if the number of registers 11 is sufficiently large, by increasing the size of each of the plurality of selection areas in the column direction, switching of the selection area to be filtered can be reduced. Therefore, the processing of reading data into the register 11 and releasing the data in the register 11 can be reduced. On the other hand, when the number of registers 11 is small, the calculation result (intermediate result) of filter processing for one selected area can be stored in the register 11 by reducing the size of each of the multiple selected areas in the column direction. I can afford to keep it. Therefore, it is possible to suppress the occurrence of the process of temporarily writing the intermediate result to the main memory 21 and then reading the intermediate result from the main memory 21. Therefore, it is preferable that the processor 10 determines the size of each of the plurality of selection areas in the column direction based on the number of registers 11 and the like. Further, the processor 10 performs convolution layer processing by experimentally changing the column direction size of each of the plurality of selection regions, and selects the one that can be processed fastest as a result of the final plurality of selection regions. The size of each selected area in the column direction may also be used.

また、本実施形態では、複数の選択領域のそれぞれが、フィルタのサイズから１を引いた数の列だけ他の選択領域の列と重複したデータを有する。なお、複数の選択領域のそれぞれが、フィルタのサイズから１を引いた数の行だけ他の選択領域の行と重複したデータを有していてもよい。ここで、選択領域間でデータが重複させずに、例えば、選択領域の端のセルについてフィルタ処理を行う場合、入力データにおける当該端のセルの周囲のセルのデータも用いる必要がある。このため、このような場合には、プロセッサ１０は、他の選択領域（メインメモリ２１の他のメモリブロック）から当該周囲のセルのデータを読み出す必要がある。一方、本実施形態のようにデータが重複していれば、その重複部分のセルのデータを用いて、フィルタ処理を実行できるため、他の選択領域のセルからプロセッサ１０がデータを読み出す必要がなくなる。よって、プロセッサ１０によるメインメモリ２１の不要な読み出しを減らすことができるので、畳み込み処理（畳み込み層での処理）が効率化する。 Furthermore, in this embodiment, each of the plurality of selection areas has data that overlaps with columns of other selection areas by the number of columns that is the size of the filter minus 1. Note that each of the plurality of selection regions may have data that overlaps with rows of other selection regions by the number of rows obtained by subtracting 1 from the size of the filter. Here, when filtering is performed, for example, on a cell at the edge of a selected area without overlapping data between selected areas, it is also necessary to use data of cells surrounding the edge cell in the input data. Therefore, in such a case, the processor 10 needs to read the data of the surrounding cells from another selected area (another memory block of the main memory 21). On the other hand, if data overlaps as in the present embodiment, filter processing can be performed using the data of cells in the overlapped portion, so the processor 10 does not need to read data from cells in other selected areas. . Therefore, unnecessary reading of the main memory 21 by the processor 10 can be reduced, making convolution processing (processing in the convolution layer) more efficient.

［優先順序に従って配置することの効果］
また、以下では、列、行、チャンネルの順で、全ての選択領域における優先順序を決定して、当該優先順序に従って選択領域を配置することの効果を説明する。 [Effect of placing according to priority order]
Further, below, the effect of determining the priority order for all selection areas in the order of column, row, and channel, and arranging the selection areas according to the priority order will be explained.

（優先順序に従って配置しない場合）
まず、優先順序に従って配置しない場合のプロセッサ１０がメインメモリ２１からデータを読み出す際の処理の一例を説明する。データ配置処理を行わない場合、図４Ａに示すような、複数のチャンネルに渡っている入力データに対して畳み込み層における処理が行われる。そして、この場合には、プロセッサ１０は、例えば、チャンネルＲの先頭行の先頭列のデータから順に、複数のレジスタ１１に格納していき、レジスタ１１に格納されたデータに基づきフィルタ処理を実行する。ここで、例えば、フィルタのサイズが３×３であった場合、図６に示すように、左上から順に、３行のワードサイズのブロックＢＬＫ１～ＢＬＫ３（データブロック）をレジスタ１１に格納する。これによって、プロセッサ１０は、３つのブロックとフィルタを用いた計算により、フィルタ処理による出力値が計算できる。その後、プロセッサ１０は、ブロックＢＬＫ１～ＢＬＫ３の次の列（右方向）に移動して、３つのブロックＢＬＫ４～ＢＬＫ６をレジスタ１１に格納して、フィルタと掛け合わせて出力値を計算する。なお、例えば、ブロックＢＬＫ１～ＢＬＫ３とブロックＢＬＫ４～ＢＬＫ６とに跨る局所領域に対してフィルタ処理を行う場合には、プロセッサ１０は、ブロックＢＬＫ１～ＢＬＫ６の６つをレジスタ１１に格納して、これらのブロックのデータを用いてフィルタ処理を行う。つまり、１つの局所領域に対してフィルタ処理を行う場合に、取得したブロックのデータを格納しておくために、６個（フィルタサイズの２倍）のレジスタが必要になることがある。 (If not placed according to priority order)
First, an example of a process when the processor 10 reads data from the main memory 21 when the data is not arranged according to the priority order will be described. When data arrangement processing is not performed, processing in a convolution layer is performed on input data spanning multiple channels as shown in FIG. 4A. In this case, the processor 10 sequentially stores data in a plurality of registers 11 starting with the data in the first column of the first row of channel R, and performs filter processing based on the data stored in the registers 11. . Here, for example, if the size of the filter is 3×3, three rows of word-sized blocks BLK1 to BLK3 (data blocks) are stored in the register 11 in order from the upper left, as shown in FIG. Thereby, the processor 10 can calculate the output value of the filter process by calculation using the three blocks and the filter. After that, the processor 10 moves to the next column (to the right) of the blocks BLK1 to BLK3, stores the three blocks BLK4 to BLK6 in the register 11, and calculates the output value by multiplying them by the filter. Note that, for example, when performing filter processing on a local area spanning blocks BLK1 to BLK3 and blocks BLK4 to BLK6, the processor 10 stores six blocks BLK1 to BLK6 in the register 11, and stores these six blocks BLK1 to BLK6 in the register 11. Perform filter processing using block data. That is, when performing filter processing on one local region, six registers (twice the filter size) may be required to store the data of the obtained block.

そして、このような処理を繰り返して、先頭の３行のデータを用いたフィルタ処理が終了すると、２行目～４行目のデータに対して同様の処理を行う。このとき、複数のレジスタ１１の数にも限りがあるので、初期のフィルタ処理に用いたブロックについてのデータは、複数のレジスタ１１から破棄されてしまう。これによれば、プロセッサ１０は、３行目のブロックＢＬＫ３のセルを中心とする局所領域についてフィルタ処理を実行しようとする場合に、再度、ブロックＢＬＫ２，ＢＬＫ３を読み出す必要がある。このため、本実施形態の優先順序に従って配置しない場合には、１つのセルのデータが、フィルタのサイズの数だけ複数のメインメモリ２１から読み出される可能性がある。 Then, by repeating such processing, when the filter processing using the data in the first three rows is completed, the same processing is performed on the data in the second to fourth rows. At this time, since the number of registers 11 is limited, the data regarding the block used in the initial filtering process is discarded from the registers 11. According to this, when the processor 10 attempts to perform filter processing on a local area centered on the cell of block BLK3 in the third row, it is necessary to read blocks BLK2 and BLK3 again. Therefore, if the arrangement is not performed according to the priority order of this embodiment, there is a possibility that the data of one cell is read from a plurality of main memories 21 equal in number to the size of the filter.

（優先順序に従って配置する場合）
続いて、本実施形態に係るデータ配置処理を行う場合のプロセッサ１０がメインメモリ２１からデータを読み出す際の処理を説明する。本実施形態では、図４Ａに示す入力データを、図４Ｂに示すような配置データに配置し直して、当該配置データを用いてプロセッサ１０は、フィルタ処理を実行する。 (When arranging according to priority order)
Next, a process when the processor 10 reads data from the main memory 21 when performing the data arrangement process according to the present embodiment will be described. In this embodiment, the input data shown in FIG. 4A is rearranged into arrangement data as shown in FIG. 4B, and the processor 10 executes filter processing using the arrangement data.

ここで、例えば、各選択領域の行の大きさがワードサイズと同じであると仮定する。この場合、プロセッサ１０は、配置データを１行ずつ列方向に連続で（行順で）読み出して、レジスタ１１にそれぞれ格納していく。そして、そのレジスタ１１に格納されたデータを用いてフィルタ処理を実行する。なお、各選択領域は、列方向および行方向に、フィルタサイズから１を引いた分の行および列だけ他の選択領域と重複したデータを有するとする。この場合には、プロセッサ１０は、配置データを行順に読み出せば、過去に配置データから読み出した行を再度読み出して、フィルタ処理に用いるということを行わなくてよい。つまり、プロセッサ１０によるレジスタ１１への読み出し処理を連続的にする（単純化する）ことができる。 Here, for example, assume that the row size of each selection area is the same as the word size. In this case, the processor 10 reads out the arrangement data row by row in the column direction (in row order) and stores them in the registers 11, respectively. Then, filter processing is executed using the data stored in the register 11. It is assumed that each selection area has data that overlaps with other selection areas by the number of rows and columns that are equal to the filter size minus 1 in the column and row directions. In this case, if the processor 10 reads the arrangement data in row order, there is no need to read out the rows previously read from the arrangement data again and use them for filter processing. In other words, the process of reading data from the register 11 by the processor 10 can be made continuous (simplified).

そして、図２を用いて説明したように、複数のチャンネルそれぞれに対してフィルタ処理をした結果（中間結果）を合計することによって、畳み込み層の出力値は算出される。これに対応すべく、配置データでは、複数のチャンネルにおいて対応する位置（対応する列および行）の選択領域が連続して連結されるように配置されている。これによれば、プロセッサ１０は、例えば、チャンネルＲの選択領域、これに対応するチャンネルＧの選択領域、チャンネルＢの選択領域を連続してフィルタ処理することができる。このため、チャンネルＲの選択領域に対してフィルタ処理した結果をレジスタ１１に保持した状態で、チャンネルＧの選択領域、チャンネルＢの選択領域をフィルタ処理できる可能性が向上する。従って、例えば、プロセッサ１０が、チャンネルＲの選択領域に対してフィルタ処理した結果をメインメモリ２１に記憶し、チャンネルＧ，Ｂ選択領域をフィルタ処理し終わった後に、メインメモリ２１からチャンネルＲの選択領域に対してフィルタ処理した結果を読み出すという処理が不要になる。よって、畳み込み層における処理が効率化できる。 Then, as explained using FIG. 2, the output value of the convolution layer is calculated by summing the results (intermediate results) of filtering each of the plurality of channels. In order to cope with this, in the arrangement data, selection areas at corresponding positions (corresponding columns and rows) in a plurality of channels are arranged so as to be continuously connected. According to this, the processor 10 can, for example, sequentially filter the selected region of channel R, the corresponding selected region of channel G, and the selected region of channel B. Therefore, while the result of filtering the channel R selection area is held in the register 11, it is more likely that the channel G selection area and the channel B selection area can be filtered. Therefore, for example, the processor 10 stores the result of filtering the channel R selection area in the main memory 21, and after filtering the channel G and B selection areas, selects the channel R from the main memory 21. The process of reading out the results of filtering the area becomes unnecessary. Therefore, processing in the convolutional layer can be made more efficient.

［本実施形態に係るフィルタ処理について］
ここで、図７を参照して、本実施形態に係るフィルタ処理について具体的に説明する。以下では、選択領域の列方向に連続する、３行のブロック７１０，７２０，７３０（３行のデータブロック）とフィルタ７５０を用いて、ブロック７２０に含まれる複数のセル（Ｒ２２～Ｒ２７の値を有するセル）それぞれを中心とする局所領域に対するフィルタ処理について説明する。なお、ブロック７１０，７２０，７３０のそれぞれは、ワードサイズと同一の８つのセルを有しており、レジスタ１１に一時的に格納されるものとする。また、フィルタ７５０のサイズは、３セル分のサイズである（３セル×３セルである）とする。 [About filter processing according to this embodiment]
Here, with reference to FIG. 7, the filter processing according to this embodiment will be specifically described. Below, we will use three rows of blocks 710, 720, 730 (three rows of data blocks) that are continuous in the column direction of the selection area and a filter 750 to calculate the values of multiple cells (R22 to R27) included in block 720. The filtering process for local regions centered on each cell (having a cell) will be explained. It is assumed that each of the blocks 710, 720, and 730 has eight cells of the same word size and is temporarily stored in the register 11. Further, it is assumed that the size of the filter 750 is the size of three cells (3 cells x 3 cells).

まず、プロセッサ１０の処理部１０３は、ブロック７１０から、ブロック７１０、ブロック７１０を左方向（行方向）に１セル分シフトしたブロック７１１、ブロック７１０を左方向に２セル分シフトしたブロック７１２を取得する。つまり、処理部１０３は、対象のブロックから、フィルタサイズから１を引いた値のセル分だけ対象のブロックを左にシフトしたブロックまでのそれぞれを取得する。そして、処理部１０３は、ブロック７１０の各セルの値にフィルタ７５０の１行１列目のセルの値ａを一括に乗算したブロック７１５を取得する。同様に、処理部１０３は、ブロック７１１の各セルの値にフィルタ７５０の１行２列目のセルの値ｂを一括に乗算したブロック７１６を取得し、ブロック７１０の各セルの値にフィルタ７５０の１行３列目のセルの値を一括に乗算したブロック７１７を取得する。つまり、３行のブロックのうち１行目に対応するブロック７１０～７１２には
、フィルタ７５０の１行目のセルの値を乗算する。このとき、ブロック７１０～７１２のそれぞれについて、フィルタにおける、当該ブロックのシフトの量に応じた列（位置）のセルの値をブロック７１０～７１２に乗算する。 First, the processing unit 103 of the processor 10 acquires, from the block 710, a block 710, a block 711 obtained by shifting the block 710 by one cell in the left direction (row direction), and a block 712 obtained by shifting the block 710 by two cells in the left direction. do. In other words, the processing unit 103 acquires each of the blocks from the target block to a block obtained by shifting the target block to the left by the cell value of the filter size minus 1. Then, the processing unit 103 obtains a block 715 by collectively multiplying the value of each cell of the block 710 by the value a of the cell in the first row and first column of the filter 750. Similarly, the processing unit 103 obtains a block 716 in which the value of each cell of the block 711 is multiplied by the value b of the cell in the first row and second column of the filter 750, and multiplies the value of each cell of the block 710 by A block 717 is obtained by collectively multiplying the values of the cells in the first row and third column. That is, blocks 710 to 712 corresponding to the first row of the three rows of blocks are multiplied by the value of the cell in the first row of the filter 750. At this time, each of the blocks 710 to 712 is multiplied by the value of the cell in the column (position) in the filter that corresponds to the amount of shift of the block.

このような処理を、ブロック７２０，７３０についても同様に行うことにより、処理部１０３は、図７に示す、ブロック７２０，７３０をシフトした、ブロック７２０～７２２，７３０～７３２を取得する。また、処理部１０３は、ブロック７２０～７２２，７３０～７３２のそれぞれに、フィルタ７５０のうちの対応するセルの値（３つのブロック７１０，７２０，７３０のうち対象のブロックが対応するブロック（行）、および対象のブロックのシフト量に、対応するセルの値）を乗算することによって、ブロック７２５，７２６，７２７，７３５，７３６，７３７を取得する。 By similarly performing such processing on blocks 720 and 730, processing unit 103 obtains blocks 720 to 722 and 730 to 732, which are obtained by shifting blocks 720 and 730, as shown in FIG. The processing unit 103 also inputs the value of the corresponding cell in the filter 750 to each of the blocks 720 to 722 and 730 to 732 (the block (row) to which the target block corresponds among the three blocks 710, 720, and 730). , and the shift amount of the target block by the value of the corresponding cell) to obtain blocks 725, 726, 727, 735, 736, and 737.

そして、処理部１０３は、ブロック７１５～７１７，７２５～７２７，７３５～７３７の同じ位置（列）のセルの値を合計することによって、Ｒ２２～Ｒ２７を有するセルを中心とする局所領域についての出力値ｖ２２～ｖ２７を算出できる。 Then, the processing unit 103 sums up the values of cells in the same position (column) of blocks 715 to 717, 725 to 727, and 735 to 737, thereby outputting an output for a local area centered on the cell having R22 to R27. Values v22 to v27 can be calculated.

なお、出力値ｖ２２～ｖ２７を算出するにあたっては、上述の算出順序である必要はなく、最適化された算出順序であってよい。例えば、処理部１０３は、ブロック７１５～ブロック７１７の同一の位置のセルの値を合計した後にその値を１つのレジスタ１１に格納し、ブロック７２０に基づき、ブロック７２０～ブロック７２３を取得し、さらにブロック７２５～７２７を取得する順序で処理を行ってもよい。これによれば、ブロック７１５～ブロック７１７の値を格納していた３つのレジスタ１１のうち少なくとも一部は、格納しているデータを解放（破棄）できる。このため、少ないレジスタ１１の数で、フィルタ処理が可能になる。 Note that when calculating the output values v22 to v27, the calculation order does not need to be as described above, and an optimized calculation order may be used. For example, the processing unit 103 adds up the values of cells at the same position in blocks 715 to 717, stores that value in one register 11, obtains blocks 720 to 723 based on block 720, and further The processing may be performed in the order in which blocks 725 to 727 are acquired. According to this, the data stored in at least some of the three registers 11 that stored the values of blocks 715 to 717 can be released (discarded). Therefore, filter processing can be performed with a small number of registers 11.

このようにフィルタ処理が行われることによれば、複数のセル（本実施形態では６つのセル）について一括でフィルタ処理を実行することができる。このため、１つ１つのセルについて、フィルタ処理を実行する場合よりも、大幅に処理数を減少させることができる。なお、このような一括の処理は、例えば、単一の命令で複数のデータに対して同じ処理を行うことができるＳＩＭＤ（ＳｉｎｇｌｅＩｎｓｔｒｕｃｔｉｏｎＭｕｌｔｉｐｌｅＤａｔａ）と呼ばれる処理方式を用いることで実現できる。 By performing filter processing in this manner, it is possible to perform filter processing on a plurality of cells (six cells in this embodiment) at once. Therefore, the number of processes can be significantly reduced compared to the case where filter processing is performed on each cell one by one. Note that such batch processing can be realized, for example, by using a processing method called SIMD (Single Instruction Multiple Data), which allows the same processing to be performed on multiple pieces of data with a single instruction.

そして、本実施形態では、メインメモリの配置データにおいて、複数の選択領域のそれぞれの行方向の大きさは、プロセッサが一括で読み出し可能なセルの数（ワードサイズ）の整数倍に対応する大きさである。これによれば、一括で所定数のセルからデータを読み出せるプロセッサが、選択領域から所定数未満のセルのデータしか読み出さないことを防げる。つまり、プロセッサの読み出し能力を最大限に活用できる。 In this embodiment, in the layout data of the main memory, the size of each of the plurality of selection areas in the row direction is a size corresponding to an integral multiple of the number of cells (word size) that the processor can read at once. It is. According to this, it is possible to prevent a processor that can read data from a predetermined number of cells at once from reading data from fewer than the predetermined number of cells from the selected area. In other words, the read capacity of the processor can be utilized to the fullest.

さらに、複数の選択領域のそれぞれは、フィルタのサイズから１を引いた数の列（および行）だけ他の選択領域の列（および行）と重複したデータを有する。これによれば、フィルタ処理の対象のセルの周囲に、当該フィルタ処理に必要な他のセルが集まっている状態になる。このため、メインメモリからの不要な読み出しを行う必要性が低減できるので、プロセッサは、メインメモリからの読み出し回数を低減しながら、畳み込み層のフィルタ処理を実行可能である。 Furthermore, each of the plurality of selection regions has data that overlaps columns (and rows) of other selection regions by a number of columns (and rows) that is the size of the filter minus one. According to this, the cell to be filtered is surrounded by other cells necessary for the filtering. Therefore, the need to perform unnecessary reading from the main memory can be reduced, so that the processor can perform filter processing on the convolutional layer while reducing the number of times of reading from the main memory.

従って、畳み込みニューラルネットワークの畳み込み層の処理を効率的に実現できる。 Therefore, the processing of the convolutional layer of the convolutional neural network can be efficiently realized.

なお、ステップＳ１００４において、配置部１０２は、２つの選択領域のデータを連結する場合に、当該２つの選択領域の間にフィルタのデータを差し込んでもよい。つまり、配置データにおいて、配置部１０２は、隣接する２つの選択領域の間に、フィルタのデー
タを有する行が配置されていてもよい。このとき、２つの選択領域のうちフィルタよりも後に連結される選択領域に対して当該フィルタを用いたフィルタ処理を実行するために、当該選択領域に対応するフィルタのデータが差し込まれる。これによれば、例えば、チャンネル数分のフィルタのデータを常に、複数のレジスタ１１が格納しておく必要がなくなるため、レジスタ１１をより効果的にフィルタ処理に用いることが可能になる。 Note that in step S1004, when connecting the data of two selected areas, the arrangement unit 102 may insert filter data between the two selected areas. That is, in the arrangement data, the arrangement unit 102 may arrange a row having filter data between two adjacent selection areas. At this time, data of the filter corresponding to the selected area is inserted in order to perform filter processing using the filter on the selected area that is connected after the filter among the two selected areas. According to this, for example, it is no longer necessary for a plurality of registers 11 to always store filter data for the number of channels, so that the registers 11 can be used more effectively for filter processing.

なお、実施形態に記載された事項のみによって特許請求の範囲の記載の解釈が限定されるものではない。特許請求の範囲の記載の解釈には、出願時の技術常識を考慮した、発明の課題が解決できることを当業者が認識できるように記載された範囲も含む。 Note that the interpretation of the claims is not limited only by the matters described in the embodiments. The interpretation of the claims includes the range described in such a way that a person skilled in the art can recognize that the problem to be solved by the invention can be solved, taking into consideration the common general knowledge at the time of filing.

（付記１）
各セルにデータを有する２次元データである入力データであって、畳み込みニューラルネットワークの畳み込み層の入力データに対する畳み込み処理を行う処理装置（１）であって、
前記入力データから複数の領域を選択して、前記複数の領域のデータを連結しメインメモリ（２１）上の連続したアドレスに配置データとして配置する配置手段（１０２）と、
前記配置データをプロセッサ（１０）により読み出して、前記配置データに対してフィルタを用いた畳み込み処理を行う処理手段（１０３）と、
を有し、
前記複数の領域のそれぞれの行方向の大きさは、プロセッサ（１０）が一括で読み出し可能なセルの数の整数倍に対応する大きさであって、
前記複数の領域のそれぞれは、前記フィルタのサイズから１を引いた数の列だけ他の領域の列と重複したデータを有する、
ことを特徴とする処理装置（１）。 (Additional note 1)
A processing device (1) that performs convolution processing on input data of a convolution layer of a convolutional neural network, the input data being two-dimensional data having data in each cell,
arranging means (102) for selecting a plurality of areas from the input data, concatenating the data of the plurality of areas and arranging the data at consecutive addresses on the main memory (21) as arrangement data;
processing means (103) for reading out the arrangement data by a processor (10) and performing convolution processing on the arrangement data using a filter;
has
The size in the row direction of each of the plurality of areas is a size corresponding to an integral multiple of the number of cells that can be read out at once by the processor (10),
Each of the plurality of regions has data that overlaps columns of other regions by a number of columns that is the size of the filter minus 1.
A processing device (1) characterized by:

（付記２）
各セルにデータを有する２次元データである入力データであって、畳み込みニューラルネットワークの畳み込み層の入力データに対する畳み込み処理を行う処理装置（１）のプロセッサ（１０）が実行する処理方法であって、
前記入力データから複数の領域を選択して、前記複数の領域のデータを連結しメインメモリ（２１）上の連続したアドレスに配置データとして配置する配置ステップ（Ｓ１００５）と、
前記配置データをプロセッサ（１０）により読み出して、前記配置データに対してフィルタを用いた畳み込み処理を行う処理ステップと、
を有し、
前記複数の領域のそれぞれの行方向の大きさは、プロセッサ（１０）が一括で読み出し可能なセルの数の整数倍に対応する大きさであって、
前記複数の領域のそれぞれは、前記フィルタのサイズから１を引いた数の列だけ他の領域の列と重複したデータを有する、
ことを特徴とする処理方法。 (Additional note 2)
A processing method executed by a processor (10) of a processing device (1) that performs convolution processing on input data of a convolution layer of a convolutional neural network, the input data being two-dimensional data having data in each cell,
a placement step (S1005) of selecting a plurality of areas from the input data, concatenating the data of the plurality of areas and placing it as placement data at consecutive addresses on the main memory (21);
a processing step of reading the placement data by a processor (10) and performing convolution processing using a filter on the placement data;
has
The size in the row direction of each of the plurality of areas is a size corresponding to an integral multiple of the number of cells that can be read out at once by the processor (10),
Each of the plurality of regions has data that overlaps columns of other regions by a number of columns that is the size of the filter minus 1.
A processing method characterized by:

（付記３）
各セルにデータを有する２次元データである入力データであって、畳み込みニューラルネットワークの畳み込み層の入力データに対する畳み込み処理を行う処理装置（１）のプロセッサ（１０）が処理方法を実行するためのプログラムであって、
前記処理方法は、
前記入力データから複数の領域を選択して、前記複数の領域のデータを連結しメインメモリ（２１）上の連続したアドレスに配置データとして配置する配置ステップ（Ｓ１００５）と、
前記配置データをプロセッサ（１０）により読み出して、前記配置データに対してフィ
ルタを用いた畳み込み処理を行う処理ステップと、
を有し、
前記複数の領域のそれぞれの行方向の大きさは、プロセッサ（１０）が一括で読み出し可能なセルの数の整数倍に対応する大きさであって、
前記複数の領域のそれぞれは、前記フィルタのサイズから１を引いた数の列だけ他の領域の列と重複したデータを有する、
ことを特徴とするプログラム。 (Appendix 3)
A program for executing a processing method by a processor (10) of a processing device (1) that performs convolution processing on input data of a convolution layer of a convolutional neural network, which is input data that is two-dimensional data having data in each cell. And,
The processing method includes:
a placement step (S1005) of selecting a plurality of areas from the input data, concatenating the data of the plurality of areas and placing it as placement data at consecutive addresses on the main memory (21);
a processing step of reading the placement data by a processor (10) and performing convolution processing using a filter on the placement data;
has
The size in the row direction of each of the plurality of areas is a size corresponding to an integral multiple of the number of cells that can be read out at once by the processor (10),
Each of the plurality of regions has data that overlaps columns of other regions by a number of columns that is the size of the filter minus 1.
A program characterized by:

１：処理装置、１０：プロセッサ、１１：レジスタ、
２０：記憶装置、２１：メインメモリ、３０：入出力装置、４０：バス、
１０１：取得部、１０２：配置部、１０３：処理部 1: Processing device, 10: Processor, 11: Register,
20: storage device, 21: main memory, 30: input/output device, 40: bus,
101: Acquisition unit, 102: Arrangement unit, 103: Processing unit

Claims

A processing device that performs convolution processing on input data of a convolution layer of a convolutional neural network, the input data being two-dimensional data having data in each cell,
arranging means for selecting a plurality of areas from the input data, concatenating data in the plurality of areas and arranging the data at consecutive addresses on a main memory as arrangement data;
processing means for reading the arrangement data by a processor and performing convolution processing on the arrangement data using a filter;
has
The size in the row direction of each of the plurality of areas is a size corresponding to an integral multiple of the number of cells that can be read out at once by the processor,
Each of the plurality of regions has data that overlaps columns of other regions by a number of columns that is the size of the filter minus 1.
A processing device characterized by:

The arrangement means connects the data of the plurality of areas in a column direction on the main memory and arranges the data as the arrangement data.
The processing device according to claim 1, characterized in that:

When arranging the arrangement data, the arrangement means places a row having the filter data between two adjacent regions of the plurality of regions;
The processing device according to claim 2, characterized in that:

Each of the plurality of regions further has data that overlaps with rows in other regions by a number of rows equal to the size of the filter minus 1.
The processing device according to any one of claims 1 to 3, characterized in that:

The first memory address of each row of the arrangement data is a memory address that is an integral multiple of the number of memory addresses that the processor can read out at once.
The processing device according to any one of claims 1 to 4.

The input data is data spanning multiple channels,
The arrangement means determines a priority order for the plurality of areas in the order of column, row, and channel, and connects the data of the plurality of areas in the column direction on the main memory according to the priority order to create the arrangement data. to be placed as,
The processing device according to any one of claims 1 to 5.

When performing the convolution process, the processing means:
reading data blocks of a first plurality of rows of rows corresponding to the size of the filter and which are continuous in the column direction in the arrangement data from the arrangement data and storing them in a register;
performing a process of shifting in the row direction and a process of multiplying the data block in each row of the first plurality of rows by the value of one cell of the filter;
The processing device according to any one of claims 1 to 6, characterized in that:

When performing the convolution process, the processing means:
A data block obtained by shifting the data block in each row of the first plurality of rows by one cell in the row direction, and a data block in each row of the first plurality of rows having a number of cells equal to the size of the filter minus 1. obtaining each data block up to the data block shifted in the row direction by the amount and the data block of each row of the first plurality of rows;
Multiplying each of the obtained data blocks by a value indicated by a cell of the filter corresponding to a row to which the data block corresponds among the first plurality of rows and an amount by which the data block is shifted. Obtain a second multi-row data block by
summing the values of cells at corresponding positions in the data blocks of each row of the second plurality of rows;
The processing device according to claim 7, characterized in that:

A processing method executed by a processor of a processing device that performs convolution processing on input data of a convolution layer of a convolutional neural network, the input data being two-dimensional data having data in each cell,
a placement step of selecting a plurality of areas from the input data, concatenating the data of the plurality of areas and placing the data at consecutive addresses on the main memory as placement data;
a processing step of reading the placement data by a processor and performing convolution processing using a filter on the placement data;
has
The size in the row direction of each of the plurality of areas is a size corresponding to an integral multiple of the number of cells that can be read out at once by the processor,
Each of the plurality of regions has data that overlaps columns of other regions by a number of columns that is the size of the filter minus 1.
A processing method characterized by:

A program for executing a processing method by a processor of a processing device that performs convolution processing on input data of a convolution layer of a convolutional neural network, the input data being two-dimensional data having data in each cell,
The processing method includes:
a placement step of selecting a plurality of areas from the input data, concatenating the data of the plurality of areas and placing the data at consecutive addresses on the main memory as placement data;
a processing step of reading the placement data by a processor and performing convolution processing using a filter on the placement data;
has
The size in the row direction of each of the plurality of areas is a size corresponding to an integral multiple of the number of cells that can be read out at once by the processor,
Each of the plurality of regions has data that overlaps columns of other regions by a number of columns that is the size of the filter minus 1.
A program characterized by: