JP2021060748A

JP2021060748A - Arithmetic processing device and arithmetic processing method

Info

Publication number: JP2021060748A
Application number: JP2019184072A
Authority: JP
Inventors: 能毅黒川; Yoshiki Kurokawa; 雄一郎青木; Yuichiro Aoki; 田中　剛; Tsuyoshi Tanaka; 剛田中
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2019-10-04
Filing date: 2019-10-04
Publication date: 2021-04-15

Abstract

To accomplish the acceleration of a matrix arithmetic processing.SOLUTION: An arithmetic processing device is accessible to a storage device that stores matrix data, and includes: a deciding unit which increases a partial row size that is a number of rows in a column direction of a partial matrix that is a dividing unit of the matrix data, obtains, when the partial row size after the increase becomes equal to or greater than a number of multiple multipliers, a partial column size that is a number of columns in a row direction of the partial matrix on the basis of the partial row size before the increase and of the number of multiple multipliers, and decides a shape of the partial matrix formed by the partial row size and by the partial column size; the plurality of multipliers; a plurality of adders; and a matrix arithmetic processing unit that executes a product-sum operation on a first partial matrix with the shape decided by the deciding unit among the pieces of first matrix data stored in the storage device, and a second partial matrix with the shape decided by the deciding unit among the pieces of second matrix data stored in the storage device.SELECTED DRAWING: Figure 2

Description

本発明は、行列演算を実行する演算装置および演算方法に関する。 The present invention relates to an arithmetic unit and an arithmetic method for performing matrix operations.

様々なアプリケーションにおいて、計算機上で行列演算の演算が実行されるが、演算量が大きく全体の処理のボトルネックとなる。そこで、従来から様々な方式により、高速化が追求されている。たとえば、特許文献１の情報処理装置は、行列を、行方向の先頭から、行方向にブロックサイズの倍数単位で４個に分割し、４個の第１部分行列を特定する。情報処理装置は、行列のうち４個の第１部分行列以外の領域を４個に分割し、４個の第２部分行列を特定する。情報処理装置は、それぞれの第１部分行列の各要素の値を生成する行列演算と、それぞれの第２部分行列の各要素の値を生成する行列演算とを、それぞれのスレッドに割り当てる。 In various applications, matrix operations are executed on a computer, but the amount of operations is large and becomes a bottleneck in the overall processing. Therefore, high speed has been pursued by various methods. For example, the information processing apparatus of Patent Document 1 divides a matrix into four in the row direction in units of multiples of the block size from the beginning in the row direction, and specifies four first submatrixes. The information processing apparatus divides the area other than the four first sub-matrix of the matrix into four and specifies the four second sub-matrix. The information processing apparatus assigns to each thread a matrix operation that generates a value of each element of each first submatrix and a matrix operation that generates a value of each element of each second submatrix.

特開２０１８‐１９７９０６号公報JP-A-2018-197906

ＦＰＧＡ（Ｆｉｅｌｄ−ＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）のようなオフロードデバイスは、内部メモリを一定量有し、内部メモリへの高速アクセスにより、プロセッサに替わって高速に演算することで、プロセッサの負荷を低減させる。特許文献１の方式をオフロードデバイスに適用すると、オフロードデバイスの内部メモリは、プロセッサのレジスタよりサイズが大きいため、内部メモリを十分に生かすことができない。 An offload device such as an FPGA (Field-Programmable Gate Array) has a certain amount of internal memory, and high-speed access to the internal memory reduces the load on the processor by performing high-speed calculation instead of the processor. When the method of Patent Document 1 is applied to an offload device, the internal memory of the offload device is larger than the register of the processor, so that the internal memory cannot be fully utilized.

また、ＦＰＧＡなど論理で構成する場合には、行列の一部分が割り当てられない部分専用の回路を必要とするため、非効率的である。ＦＰＧＡは、内部メモリを２次元的に行列の一部分に割り当て、その部分の乗算を高速に実行する。ただし、この方式では、元になる行列の行および列のサイズである行列の形状により、演算効率を最高にするための最適な部分行列の形状が異なる。 Further, when it is configured by logic such as FPGA, it is inefficient because a circuit dedicated to a part where a part of the matrix is not allocated is required. The FPGA two-dimensionally allocates internal memory to a part of the matrix and executes multiplication of that part at high speed. However, in this method, the optimum shape of the submatrix for maximizing the calculation efficiency differs depending on the row of the original matrix and the shape of the matrix which is the size of the column.

ＦＰＧＡが一定の行列サイズに対して乗算を行う場合、事前に部分行列のサイズをチューニングして固定することにより、一定の行列サイズに対して乗算が可能となる。しかしながら、様々なサイズの行列入力が想定される場合、部分行列のサイズを一意に決定することが困難となる。 When the FPGA performs multiplication on a fixed matrix size, it is possible to multiply on a fixed matrix size by tuning and fixing the size of the submatrix in advance. However, when matrix inputs of various sizes are expected, it becomes difficult to uniquely determine the size of the submatrix.

本発明は、行列演算の高速化を図ることを目的とする。 An object of the present invention is to speed up matrix operations.

本願において開示される発明の一側面となる演算装置は、行列データを記憶する記憶デバイスにアクセス可能な演算装置であって、前記行列データの分割単位となる部分行列の行方向の列数である部分列サイズを増加させ、増加後の部分列サイズが複数の乗算器の個数以上となった場合、増加前の前記部分列サイズと前記複数の乗算器の個数とに基づいて、前記部分行列の列方向の行数である部分行サイズを求め、前記部分列サイズと前記部分行サイズとにより構成される前記部分行列の形状を決定する決定部と、前記複数の乗算器と複数の加算器とを有し、前記記憶デバイスに記憶されている第１行列データのうち前記決定部によって決定された形状の第１部分行列と、前記記憶デバイスに記憶されている第２行列データのうち前記決定部によって決定された形状の第２部分行列と、の積和演算を実行する行列演算部と、を有することを特徴とする。 The arithmetic device that is one aspect of the invention disclosed in the present application is an arithmetic device that can access a storage device that stores matrix data, and is the number of columns in the row direction of a submatrix that is a division unit of the matrix data. When the sub-column size is increased and the increased sub-column size is equal to or greater than the number of the plurality of multipliers, the sub-matrix of the sub-matrix is based on the sub-column size before the increase and the number of the plurality of multipliers. A determination unit that obtains a partial row size, which is the number of rows in the column direction, and determines the shape of the submatrix composed of the subcolumn size and the partial row size, and the plurality of multipliers and a plurality of adders. The first matrix data having a shape determined by the determination unit among the first matrix data stored in the storage device, and the determination unit among the second matrix data stored in the storage device. It is characterized by having a second submatrix having a shape determined by the above and a matrix calculation unit that executes a product-sum operation.

本発明の代表的な実施の形態によれば、行列演算の高速化を図ることができる。前述した以外の課題、構成及び効果は、以下の実施例の説明により明らかにされる。 According to a typical embodiment of the present invention, the speed of matrix operation can be increased. Issues, configurations and effects other than those described above will be clarified by the description of the following examples.

図１は、部分行列の形状の決定例を示す説明図である。FIG. 1 is an explanatory diagram showing an example of determining the shape of a submatrix. 図２は、演算装置のハードウェア構成例を示すブロック図である。FIG. 2 is a block diagram showing a hardware configuration example of the arithmetic unit. 図３は、部分行列形状決定回路による部分行列サイズ決定の上位側アルゴリズムを示すフローチャートである。FIG. 3 is a flowchart showing a higher-level algorithm for determining the submatrix size by the submatrix shape determining circuit. 図４は、ステップＳ３０３およびＳ３０６における部分行列形状決定回路による部分行列サイズ決定アルゴリズムを示すフローチャートである。FIG. 4 is a flowchart showing a submatrix size determination algorithm by the submatrix shape determination circuit in steps S303 and S306. 図５は、センスアンプキャッシュサイズと列サイズと部分列サイズとの関係を示す説明図である。FIG. 5 is an explanatory diagram showing the relationship between the sense amplifier cache size, the column size, and the subsequence size. 図６は、部分行列形状決定回路による部分行列を使用した乗算実行アルゴリズムを示すフローチャートである。FIG. 6 is a flowchart showing a multiplication execution algorithm using a submatrix by the submatrix shape determination circuit. 図７は、部分行列形状決定回路以降の処理例を示す説明図である。FIG. 7 is an explanatory diagram showing a processing example after the submatrix shape determination circuit. 図８は、行列演算アレイの動作例を示す説明図である。FIG. 8 is an explanatory diagram showing an operation example of the matrix operation array. 図９は、レジスタ群の詳細を示すブロック図である。FIG. 9 is a block diagram showing details of the register group. 図１０は、制御プログラムによる制御処理例を示すフローチャートである。FIG. 10 is a flowchart showing an example of control processing by the control program. 図１１は、ログ出力部の制御処理例を示す説明図である。FIG. 11 is an explanatory diagram showing an example of control processing of the log output unit.

＜部分行列形状の決定例＞
図１は、部分行列の形状の決定例を示す説明図である。入力行列１０１は、行方向の列数（ＣＯＬ）１０２および列方向の行数（ＲＯＷ）１０３からなる行列である。オフロードデバイスで行列演算する演算装置が入力行列を取得した場合、行列演算に先立って入力行列１０１を部分行列に分割する。部分行列は、入力行列１０１の分割単位となる。 <Example of determining submatrix shape>
FIG. 1 is an explanatory diagram showing an example of determining the shape of a submatrix. The input matrix 101 is a matrix composed of the number of columns (COL) 102 in the row direction and the number of rows (ROW) 103 in the column direction. When the arithmetic unit that performs the matrix operation by the offload device acquires the input matrix, the input matrix 101 is divided into submatrixes prior to the matrix operation. The submatrix is a division unit of the input matrix 101.

ここで、第１部分行列１１１は、列数（ＣＯＬ）１１２および行数（ＲＯＷ）１１３からなる行列である。一例として、第１部分行列１１１は、正方行列とする。第２部分行列１２１は、列数（ＣＯＬ）１２２および行数（ＲＯＷ）１２３からなる行列である。第１部分行列１１１および第２部分行列１２１において、列数（ＣＯＬ）１１２＜列数（ＣＯＬ）１２２、行数（ＲＯＷ）１１３＞行数（ＲＯＷ）１２３とする。 Here, the first submatrix 111 is a matrix composed of the number of columns (COL) 112 and the number of rows (ROW) 113. As an example, the first submatrix 111 is a square matrix. The second submatrix 121 is a matrix composed of the number of columns (COL) 122 and the number of rows (ROW) 123. In the first sub-matrix 111 and the second sub-matrix 121, the number of columns (COL) 112 <the number of columns (COL) 122, the number of rows (ROW) 113> the number of rows (ROW) 123.

第１分割例１１０は、入力行列１０１を第１部分行列１１１によって分割した分割例である。入力行列１０１第１部分行列１１１で分割すると、縦Ｖ（２個）×横Ｈ（４個）で８個の第１部分行列１１１に分割される。 The first division example 110 is a division example in which the input matrix 101 is divided by the first submatrix 111. When the input matrix 101 is divided by the first submatrix 111, it is divided into eight first submatrix 111 by vertical V (2 pieces) × horizontal H (4 pieces).

第２分割例１２０は、入力行列１０１を第２部分行列１２１によって分割した分割例である。入力行列１０１第２部分行列１２１で分割すると、縦Ｖ（３個）×横Ｈ（２個）で６個の第２部分行列１２１に分割される。 The second division example 120 is a division example in which the input matrix 101 is divided by the second submatrix 121. When the input matrix 101 is divided by the second submatrix 121, it is divided into six second submatrix 121 by vertical V (3 pieces) × horizontal H (2 pieces).

第１分割例１１０の分割数は８個であり、第２分割例１２０の分割数は６個である。演算装置は、部分行列の形状を第２部分行列１２１に決定することにより、第１部分行列１１１よりも演算回数を減らすことができ、演算処理の高速化が可能となる。 The number of divisions of the first division example 110 is eight, and the number of divisions of the second division example 120 is six. By determining the shape of the submatrix to be the second submatrix 121, the arithmetic unit can reduce the number of operations as compared with the first submatrix 111, and can speed up the arithmetic processing.

＜演算装置のハードウェア構成例＞
図２は、演算装置のハードウェア構成例を示すブロック図である。演算装置２００は、行列演算を実行する。演算装置２００は、プロセッサ２０１と、メモリ２０２と、汎用バス２０３と、ＦＰＧＡ２０４と、ＤＲＡＭ（ＤｙｎａｍｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）２０５と、を有する。なお、演算装置２００は、少なくとも、ＦＰＧＡ２０４を有していればよい。 <Hardware configuration example of arithmetic unit>
FIG. 2 is a block diagram showing a hardware configuration example of the arithmetic unit. The arithmetic unit 200 executes a matrix operation. The arithmetic unit 200 includes a processor 201, a memory 202, a general-purpose bus 203, an FPGA 204, and a DRAM (Dynamic Random Access Memory) 205. The arithmetic unit 200 may have at least FPGA 204.

プロセッサ２０１は、メモリ２０２およびＦＰＧＡ２０４を制御する。たとえば、プロセッサ２０１は、メモリ２０２上にある制御プログラム２０６の命令列を逐次読み出し、制御プログラム２０６を実行する。プロセッサ２０１は、制御プログラム２０６の実行に際し、その処理入力として、メモリ２０２からデータを読み出し、実行結果をメモリ２０２に書き出す。また、プロセッサ２０１は、汎用バス２０３を経由してＦＰＧＡ２０４にアクセスすることで、ＦＰＧＡ２０４とのデータのやり取りや制御を行う。 Processor 201 controls memory 202 and FPGA 204. For example, the processor 201 sequentially reads the instruction sequence of the control program 206 on the memory 202 and executes the control program 206. When the control program 206 is executed, the processor 201 reads data from the memory 202 as a processing input thereof and writes the execution result to the memory 202. Further, the processor 201 exchanges and controls data with the FPGA 204 by accessing the FPGA 204 via the general-purpose bus 203.

メモリ２０２は、制御プログラム２０６と、行列Ａデータ２０７と、行列Ｂデータ２０８と、結果Ｃデータ２０９と、ログデータ２１０と、各種データ（不図示）と、を記憶する。制御プログラム２０６は、演算装置２００を制御するためのプログラムであり、プロセッサ２０１により実行される。行列Ａデータ２０７は、一方の入力行列１０１を示すデータであり、行列Ｂデータ２０８は、他方の入力行列１０１を示すデータである。行列Ａデータ２０７および行列Ｂデータ２０８は、行列演算の実行時に汎用バス２０３を経由してＦＰＧＡ２０４に接続されたＤＲＡＭ２０５に転送される。 The memory 202 stores the control program 206, the matrix A data 207, the matrix B data 208, the result C data 209, the log data 210, and various data (not shown). The control program 206 is a program for controlling the arithmetic unit 200, and is executed by the processor 201. The matrix A data 207 is data indicating one input matrix 101, and the matrix B data 208 is data indicating the other input matrix 101. The matrix A data 207 and the matrix B data 208 are transferred to the DRAM 205 connected to the FPGA 204 via the general-purpose bus 203 when the matrix operation is executed.

結果Ｃデータ２０９は、行列Ａデータ２０７と行列Ｂデータ２０８との乗算結果を示すデータである。結果Ｃデータ２０９は、ＤＲＡＭ２０５からＦＰＧＡ２０４および汎用バス２０３を経由してメモリ２０２に保存される。ログデータ２１０は、ＦＰＧＡ２０４の演算処理のログを示すデータであり、汎用バス２０３経由でメモリ２０２に書き出される。汎用バス２０３は、プロセッサ２０１とＦＰＧＡ２０４とを接続しデータの送受信を行う。 The result C data 209 is data showing the multiplication result of the matrix A data 207 and the matrix B data 208. The result C data 209 is stored in the memory 202 from the DRAM 205 via the FPGA 204 and the general-purpose bus 203. The log data 210 is data indicating a log of arithmetic processing of the FPGA 204, and is written to the memory 202 via the general-purpose bus 203. The general-purpose bus 203 connects the processor 201 and the FPGA 204 to transmit and receive data.

ＦＰＧＡ２０４は、プログラマブルに論理回路を構成して動作させるオフロードデバイスである。ＦＰＧＡ２０４は、行列演算を実行する。また、ＦＰＧＡ２０４は、汎用バス２０３に接続され、プロセッサ２０１とデータの送受信を行う。また、ＦＰＧＡ２０４は、ＤＲＡＭ２０５へのアクセス機能を有し、ＤＲＡＭ２０５上へのデータの保存や読み出しを行う。また、ＦＰＧＡ２０４もＤＭＡ機能により汎用バス２０３を経由してメモリ２０２にデータを読み書きすることが可能である。 The FPGA 204 is an off-road device that programmablely configures and operates a logic circuit. FPGA 204 performs matrix operations. Further, the FPGA 204 is connected to the general-purpose bus 203 and transmits / receives data to / from the processor 201. The FPGA 204 also has an access function to the DRAM 205, and stores and reads data on the DRAM 205. Further, the FPGA 204 can also read / write data to / from the memory 202 via the general-purpose bus 203 by the DMA function.

ＦＰＧＡ２０４は、アレイサイズレジスタ２１１と、センスアンプキャッシュサイズレジスタ２１２と、スレッショルドレジスタ２１３と、行列ＡＢサイズレジスタ２１４と、部分行列形状決定回路２１５と、加算制御回路２１６と、リードアドレス生成部２１７と、読み込みバッファ２１８と、ゼロ値２１９と、セレクタ２２０と、行列演算アレイ２２１と、ライトアドレス生成部２２２と、書き出しバッファ２２３と、ＤＭＡＣ（ＤｉｒｅｃｔＭｅｍｏｒｙＡｃｃｅｓｓＣｏｎｔｒｏｌｌｅｒ）２２４と、ログ出力部２２５と、バス制御部２２６と、を有する。 The FPGA 204 includes an array size register 211, a sense amplifier cache size register 212, a threshold register 213, a matrix AB size register 214, a submatrix shape determination circuit 215, an addition control circuit 216, a read address generator 217, and the like. Read buffer 218, zero value 219, selector 220, matrix operation array 221, write address generation unit 222, write buffer 223, DMAC (Direct Memory Access Controller) 224, log output unit 225, and bus control. It has a part 226 and.

アレイサイズレジスタ２１１は、行列演算アレイ２２１のサイズを保持する。センスアンプキャッシュサイズレジスタ２１２は、ＤＲＡＭ２０５に搭載されるセンスアンプのサイズを保持する。スレッショルドレジスタ２１３は、行列演算における部分行列の形状を決定する際のセンスアンプキャッシュのヒット数をスレッショルドとして保持する。行列ＡＢサイズレジスタ２１４は、行列Ａデータのサイズと行列Ｂデータのサイズとを、それぞれ行と列に分けて保存する。 The array size register 211 holds the size of the matrix operation array 221. The sense amplifier cache size register 212 holds the size of the sense amplifier mounted on the DRAM 205. The threshold register 213 holds the number of hits of the sense amplifier cache when determining the shape of the submatrix in the matrix operation as the threshold. The matrix AB size register 214 stores the size of the matrix A data and the size of the matrix B data separately in rows and columns, respectively.

部分行列形状決定回路２１５は、アレイサイズレジスタ２１１、センスアンプキャッシュサイズレジスタ２１２、スレッショルドレジスタ２１３および行列ＡＢサイズレジスタ２１４に保存されたデータを使用して、行列演算の実行に必要な部分行列のサイズを算出する。この算出値は加算制御回路２１６に出力され、部分行列の最適なサイズでの計算を実行するために使用される。また、部分行列にマッチしない行列の端数（図１の第１分割例および第２分割例の余白の部分）を生成し、加算制御回路２１６に出力する。 The submatrix shape determination circuit 215 uses the data stored in the array size register 211, the sense amplifier cache size register 212, the threshold register 213, and the matrix AB size register 214 to determine the size of the submatrix required to perform the matrix operation. Is calculated. This calculated value is output to the addition control circuit 216 and is used to perform the calculation with the optimum size of the submatrix. Further, a fraction of the matrix that does not match the submatrix (the margin portion of the first division example and the second division example in FIG. 1) is generated and output to the addition control circuit 216.

加算制御回路２１６は、部分行列形状決定回路２１５から部分行列形状を取得し、行列演算アレイ２２１を制御する。また、加算制御回路２１６は、セレクタ２２０を制御して端数に０を挿入する。リードアドレス生成部２１７は、ＤＲＡＭ２０５のリードアドレスを生成し、行列Ａデータ２０７および行列Ｂデータ２０８をＤＲＡＭ２０５から読み出す。読み出された行列Ａデータ２０７および行列Ｂデータ２０８は、読み込みバッファ２１８に保存される。 The addition control circuit 216 acquires the submatrix shape from the submatrix shape determination circuit 215 and controls the matrix calculation array 221. Further, the addition control circuit 216 controls the selector 220 to insert 0 as a fraction. The read address generation unit 217 generates the read address of the DRAM 205, and reads the matrix A data 207 and the matrix B data 208 from the DRAM 205. The read matrix A data 207 and matrix B data 208 are stored in the read buffer 218.

読み込みバッファ２１８は、ＤＲＡＭ２０５から読み出した行列Ａデータ２０７および行列Ｂデータ２０８を保存し、セレクタ２２０経由で行列演算アレイ２２１に出力する。ゼロ値２１９２１９は、行列演算で使用するデータ型に適合したゼロの値を示すデータである。セレクタ２２０は、加算制御回路２１６からの選択指示により、読み込みバッファ２１８から送られてくる行列Ａデータ２０７または行列Ｂデータ２０８と、ゼロ値２１９と、のいずれかを選択して行列演算アレイ２２１に出力する。 The read buffer 218 stores the matrix A data 207 and the matrix B data 208 read from the DRAM 205, and outputs the matrix A data 207 and the matrix B data 208 to the matrix operation array 221 via the selector 220. The zero value 2192219 is data indicating a zero value suitable for the data type used in the matrix operation. The selector 220 selects either the matrix A data 207 or the matrix B data 208 sent from the read buffer 218 and the zero value 219 according to the selection instruction from the addition control circuit 216 to the matrix operation array 221. Output.

行列演算アレイ２２１は、セレクタ２２０からのデータを行列演算、すなわち、積和演算する。このとき、行列演算アレイ２２１は、加算制御回路２１６からの制御により、選択した部分行列サイズに合わせた積和演算を実行する。この実行結果は書き出しバッファ２２３に転送される。 The matrix operation array 221 performs a matrix operation, that is, a product-sum operation on the data from the selector 220. At this time, the matrix operation array 221 executes the product-sum operation according to the selected submatrix size under the control of the addition control circuit 216. This execution result is transferred to the write buffer 223.

ライトアドレス生成部２２２は、書き出しバッファ２２３に保存されている結果Ｃデータ２０９をＤＲＡＭ２０５に書き出すためのアドレスを生成する。書き出しバッファ２２３は、行列演算アレイ２２１が生成する結果Ｃデータ２０９を保持し、ＤＲＡＭ２０５へ書き込む。ＤＭＡＣ２２４は、ＦＰＧＡ２０４からメモリ２０２へのデータ送受信を制御する。送受信されるデータには、たとえば、行列Ａデータ２０７、行列Ｂデータ２０８、結果Ｃデータ２０９、ログデータ２１０があり、ＤＭＡＣ２２４は、ＤＲＡＭ２０５やＦＰＧＡ２０４の内部データの送受信を制御する。 The write address generation unit 222 generates an address for writing the result C data 209 stored in the write buffer 223 to the DRAM 205. The write buffer 223 holds the result C data 209 generated by the matrix operation array 221 and writes it to the DRAM 205. The DMAC224 controls data transmission / reception from the FPGA 204 to the memory 202. The data to be transmitted / received includes, for example, matrix A data 207, matrix B data 208, result C data 209, and log data 210, and the DMAC 224 controls the transmission / reception of internal data of the DRAM 205 or FPGA 204.

ログ出力部２２５は、部分行列形状決定回路２１５の動作をログとして内部に保存し、ＤＭＡＣ２２４を使用して、メモリ２０２内のログデータ２１０に出力する。バス制御部２２６は、汎用バス２０３を制御し、プロセッサ２０１によるアレイサイズレジスタ２１１、センスアンプキャッシュサイズレジスタ２１２、スレッショルドレジスタ２１３、行列ＡＢサイズレジスタ２１４への読み書き制御を行う。 The log output unit 225 internally stores the operation of the submatrix shape determination circuit 215 as a log, and outputs the operation to the log data 210 in the memory 202 by using the DMAC224. The bus control unit 226 controls the general-purpose bus 203, and controls reading and writing to the array size register 211, the sense amplifier cache size register 212, the threshold register 213, and the matrix AB size register 214 by the processor 201.

ＤＲＡＭ２０５は、データを保持し、ＦＰＧＡ２０４とデータを送受信する記憶デバイスである。ＤＲＡＭ２０５は、ＦＰＧＡ２０４に占有して使用される。 The DRAM 205 is a storage device that holds data and transmits / receives data to / from FPGA 204. The DRAM 205 is used exclusively for the FPGA 204.

演算装置２００の動作は、以下の通りである。演算装置２００は、プロセッサ２０１に制御プログラム２０６を実行させる。演算装置２００は、制御プログラム２０６により、アレイサイズレジスタ２１１、センスアンプキャッシュサイズレジスタ２１２、スレッショルドレジスタ２１３、行列ＡＢサイズレジスタ２１４を設定する。ここで、演算装置２００は、ＦＰＧＡ２０４を起動する。 The operation of the arithmetic unit 200 is as follows. The arithmetic unit 200 causes the processor 201 to execute the control program 206. The arithmetic unit 200 sets the array size register 211, the sense amplifier cache size register 212, the threshold register 213, and the matrix AB size register 214 by the control program 206. Here, the arithmetic unit 200 starts the FPGA 204.

ＦＰＧＡ２０４は、ＤＭＡＣ２２４を使用して行列Ａデータ２０７と行列Ｂデータ２０８とをＤＲＡＭ２０５に転送する。転送されると、部分行列形状決定回路２１５が起動し、行列Ａおよび行列Ｂ双方に最適な部分行列サイズを算出する。この算出結果に基づいて、リードアドレス生成部２１７が、転送されてきたＤＲＡＭ２０５上の行列Ａ、行列Ｂから部分行列サイズの部分を切り出し、読み込みバッファ２１８に転送する。 The FPGA 204 uses the DMAC224 to transfer the matrix A data 207 and the matrix B data 208 to the DRAM 205. When transferred, the submatrix shape determination circuit 215 is activated to calculate the optimum submatrix size for both the matrix A and the matrix B. Based on this calculation result, the read address generation unit 217 cuts out a submatrix size portion from the transferred matrix A and matrix B on the DRAM 205 and transfers the submatrix size portion to the read buffer 218.

セレクタ２２０は、その転送データを行列演算アレイ２２１に転送し、行列演算アレイ２２１は部分行列の演算を行い、書き出しバッファ２２３に結果Ｃデータ２０９を書き込む。行列演算アレイ２２１は、ライトアドレス生成部２２２で生成されたアドレスに従い、結果Ｃデータ２０９をＤＲＡＭ２０５に書き込む。すべての行列演算が終了した時点で、ＦＰＧＡ２０４は、ＤＭＡＣ２２４により結果Ｃデータ２０９をメモリ２０２に書き込む。また、部分行列形状決定回路２１５が算出した最適部分行列サイズは、ログ出力部２２５に送られる。ログ出力部２２５はログデータ２１０を生成し、ＤＭＡＣ２２４を使用してメモリ２０２に転送される。 The selector 220 transfers the transfer data to the matrix operation array 221, the matrix operation array 221 performs a submatrix operation, and writes the result C data 209 to the write buffer 223. The matrix operation array 221 writes the result C data 209 to the DRAM 205 according to the address generated by the write address generation unit 222. When all the matrix operations are completed, the FPGA 204 writes the result C data 209 to the memory 202 by the DMAC224. Further, the optimum submatrix size calculated by the submatrix shape determination circuit 215 is sent to the log output unit 225. The log output unit 225 generates the log data 210 and transfers it to the memory 202 using the DMAC224.

＜部分行列形状決定回路２１５による部分行列サイズ決定の上位側アルゴリズム＞
図３は、部分行列形状決定回路２１５による部分行列サイズ決定の上位側アルゴリズムを示すフローチャートである。部分行列形状決定回路２１５は、行列Ａ列サイズを行列ＡＢサイズレジスタ２１４の列サイズレジスタに設定する（ステップＳ３０１）。部分行列形状決定回路２１５は、行列Ａ行サイズを行列ＡＢサイズレジスタ２１４の行サイズレジスタに設定する（ステップＳ３０２）。 <Upper algorithm for determining submatrix size by submatrix shape determination circuit 215>
FIG. 3 is a flowchart showing a higher-level algorithm for determining the submatrix size by the submatrix shape determining circuit 215. The submatrix shape determination circuit 215 sets the matrix A column size in the column size register of the matrix AB size register 214 (step S301). The submatrix shape determination circuit 215 sets the matrix A row size in the row size register of the matrix AB size register 214 (step S302).

部分行列形状決定回路２１５は、行列Ａの部分行列の列数ｐＣＯＬおよび行数ｐＲＯＷと、行列Ａの横Ｈａおよび縦Ｖａを決定する（ステップＳ３０３）。行列Ｂの列数ｐＣＯＬは、行列Ａの部分行列における行方向の列の数であり、行列Ｂの行数ｐＲＯＷは、行列Ａの部分行列における列方向の行の数である。行列Ａの横Ｈａは、部分行列で分割された行列Ａの横Ｈ（行方向）における部分行列の数であり、行列Ａの縦Ｖａは、部分行列で分割された行列Ａの縦Ｖ（列方向）における部分行列の数である。ステップＳ３０３の詳細は、図４で後述する。 The submatrix shape determination circuit 215 determines the number of columns pCOL and the number of rows pROW of the submatrix of the matrix A, and the horizontal Ha and vertical Va of the matrix A (step S303). The number of columns pCOL of the matrix B is the number of columns in the row direction in the submatrix of the matrix A, and the number of rows pROW in the matrix B is the number of rows in the column direction in the submatrix of the matrix A. The horizontal Ha of the matrix A is the number of submatrixes in the horizontal H (row direction) of the matrix A divided by the submatrix, and the vertical Va of the matrix A is the vertical V (column) of the matrix A divided by the submatrix. The number of submatrixes in direction). Details of step S303 will be described later in FIG.

部分行列形状決定回路２１５は、行列Ｂ行サイズを行列ＡＢサイズレジスタ２１４の列サイズレジスタに設定する（ステップＳ３０４）。部分行列形状決定回路２１５は、行列Ｂ列サイズを行列ＡＢサイズレジスタ２１４の行サイズレジスタに設定する（ステップＳ３０５）。行列Ｂは転置されてＤＡされるため、ステップＳ３０４およびＳ３０５では、ステップＳ３０１およびＳ３０２とは異なり、行列Ｂの行サイズが列サイズレジスタに設定され、行列Ｂの列サイズが行サイズレジスタに設定される。 The submatrix shape determination circuit 215 sets the matrix B row size in the column size register of the matrix AB size register 214 (step S304). The submatrix shape determination circuit 215 sets the matrix B column size in the row size register of the matrix AB size register 214 (step S305). Since the matrix B is transposed and DA, in steps S304 and S305, unlike steps S301 and S302, the row size of the matrix B is set in the column size register and the column size of the matrix B is set in the row size register. To.

部分行列形状決定回路２１５は、行列Ｂの部分行列の列数ｐＣＯＬおよび行数ｐＲＯＷと、行列Ｂの横Ｈｂおよび縦Ｖｂを決定する（ステップＳ３０６）。行列Ｂの列数ｐＣＯＬは、行列Ｂの部分行列における行方向の列の数であり、行列Ｂの行数ｐＲＯＷは、行列Ｂの部分行列における列方向の行の数である。行列Ｂの横Ｈｂは、部分行列で分割された行列Ｂの横Ｈ（行方向）における部分行列の数であり、行列Ｂの縦Ｖｂは、部分行列で分割された行列Ｂの縦Ｖ（列方向）における部分行列の数である。ステップＳ３０６の詳細は、図４で後述する。 The submatrix shape determination circuit 215 determines the number of columns pCOL and the number of rows pROW of the submatrix of the matrix B, and the horizontal Hb and vertical Vb of the matrix B (step S306). The number of columns pCOL of the matrix B is the number of columns in the row direction in the submatrix of the matrix B, and the number of rows pROW in the matrix B is the number of rows in the column direction in the submatrix of the matrix B. The horizontal Hb of the matrix B is the number of submatrixes in the horizontal H (row direction) of the matrix B divided by the submatrix, and the vertical Vb of the matrix B is the vertical V (column) of the matrix B divided by the submatrix. The number of submatrixes in direction). Details of step S306 will be described later in FIG.

部分行列形状決定回路２１５は、行列Ｂの部分行列の列数ｐＣＯＬと行数ｐＲＯＷとを入れ替え、横Ｈｂと縦Ｖｂとを入れ替える（ステップＳ３０７）。部分行列形状決定回路２１５は、行列Ａの部分行列の列数ｐＣＯＬと行列Ｂの部分行列の列数ｐＣＯＬとを比較し、大きい方を部分行列の列数ｐＣＯＬとして選択する（ステップＳ３０８）。 The submatrix shape determination circuit 215 replaces the number of columns pCOL and the number of rows pROW of the submatrix of the matrix B, and replaces the horizontal Hb and the vertical Vb (step S307). The submatrix shape determination circuit 215 compares the number of columns pCOL of the submatrix of the matrix A with the number of columns pCOL of the submatrix of the matrix B, and selects the larger one as the number of columns pCOL of the submatrix (step S308).

部分行列形状決定回路２１５は、列数ｐＣＯＬが選択された方の行列（ＡまたはＢ）の部分行列の列数ｐＲＯＷを選択する（ステップＳ３０９）。部分行列形状決定回路２１５は、行列Ａが選択された場合、ステップＳ３０８で選択されたｐＣＯＬとステップＳ３０９で選択されたｐＲＯＷとを用いて、横Ｈｂおよび縦Ｖｂを再計算し、行列Ｂが選択された場合、ステップＳ３０８で選択されたｐＣＯＬとステップＳ３０９で選択されたｐＲＯＷとを用いて、横Ｈａ、縦Ｖａを再計算する（ステップＳ３１０）。これにより、一連の処理が終了する。 The submatrix shape determination circuit 215 selects the number of columns pROW of the submatrix of the matrix (A or B) for which the number of columns pCOL is selected (step S309). When the matrix A is selected, the submatrix shape determination circuit 215 recalculates the horizontal Hb and the vertical Vb using the pCOL selected in step S308 and the pROW selected in step S309, and the matrix B is selected. If so, the horizontal Ha and vertical Va are recalculated using the pCOL selected in step S308 and the pROW selected in step S309 (step S310). As a result, a series of processes is completed.

＜部分行列形状決定回路２１５の部分行列サイズ決定アルゴリズム＞
図４は、ステップＳ３０３およびＳ３０６における部分行列形状決定回路２１５による部分行列サイズ決定アルゴリズムを示すフローチャートである。部分行列形状決定回路２１５は、ＤＲＡＭ２０５への最小アクセス回数の初期値として、最小アクセス回数が取りうる最大値を設定する（ステップＳ４０１）。部分行列形状決定回路２１５は、部分列サイズに初期値の「１」を設定する（ステップＳ４０２）。 <Submatrix size determination algorithm of submatrix shape determination circuit 215>
FIG. 4 is a flowchart showing a submatrix size determination algorithm by the submatrix shape determination circuit 215 in steps S303 and S306. The submatrix shape determination circuit 215 sets the maximum value that the minimum number of accesses can take as the initial value of the minimum number of accesses to the DRAM 205 (step S401). The submatrix shape determination circuit 215 sets the initial value “1” for the subsequence size (step S402).

部分行列形状決定回路２１５は、部分列サイズを１ビットシフトして、２倍の値とする（ステップＳ４０３）。部分行列形状決定回路２１５は、部分列サイズをアレイサイズレジスタ２１１の値と比較する（ステップＳ４０４）。比較した結果、部分列サイズがアレイサイズ以上になると（ステップＳ４０４：Ｙｅｓ）、それ以上の部分列サイズを使用できないため、終了処理（ステップＳ４１５〜Ｓ４１８）に移行する。それ以外（ステップＳ４０４：Ｎｏ）では、ステップＳ４０５に移行する。 The submatrix shape determination circuit 215 shifts the subsequence size by 1 bit to double the value (step S403). The submatrix shape determination circuit 215 compares the subsequence size with the value of the array size register 211 (step S404). As a result of comparison, when the subsequence size becomes equal to or larger than the array size (step S404: Yes), since the subsequence size larger than that cannot be used, the process proceeds to the termination process (steps S415 to S418). Other than that (step S404: No), the process proceeds to step S405.

部分行列形状決定回路２１５は、センスアンプキャッシュサイズレジスタ２１２と対象行列の列サイズレジスタと、現在の部分列サイズと、に基づいて、センスアンプキャッシュヒット回数を算出する（ステップＳ４０５）。対象行列とは、本フローチャートの処理がステップＳ３０３内の処理であれば、行列Ａであり、本フローチャートの処理がステップＳ３０６内の処理であれば、行列Ｂである。 The submatrix shape determination circuit 215 calculates the number of sense amplifier cache hits based on the sense amplifier cache size register 212, the column size register of the target matrix, and the current submatrix size (step S405). The target matrix is the matrix A if the processing of this flowchart is the processing in step S303, and the target matrix is the matrix B if the processing of this flowchart is the processing in step S306.

図５は、センスアンプキャッシュサイズと列サイズと部分列サイズとの関係を示す説明図である。部分行列形状決定回路２１５は、具体的には、たとえば、センスアンプキャッシュサイズレジスタ２１２の値（センスアンプキャッシュサイズ）を対象行列の列サイズで割り算し、その割り算結果の余りを切り捨てる。そして、部分行列形状決定回路２１５は、余り切り捨て後の割り算結果と、部分行列の部分列サイズと、を乗算する。部分行列形状決定回路２１５は、この乗算結果に、同じくセンスアンプキャッシュサイズを列サイズで割り算した余りと、部分列サイズと、の小さい方を加算する。この加算結果がセンスアンプキャッシュヒット回数である。 FIG. 5 is an explanatory diagram showing the relationship between the sense amplifier cache size, the column size, and the subsequence size. Specifically, the submatrix shape determination circuit 215 divides the value of the sense amplifier cache size register 212 (sense amplifier cache size) by the column size of the target matrix, and truncates the remainder of the division result. Then, the submatrix shape determination circuit 215 multiplies the division result after truncation of the remainder and the submatrix size of the submatrix. The submatrix shape determination circuit 215 adds the remainder of the sense amplifier cache size divided by the column size and the submatrix size, whichever is smaller, to the multiplication result. The result of this addition is the number of sense amplifier cache hits.

図４に戻り、部分行列形状決定回路２１５は、ステップＳ４０５で算出したセンスアンプキャッシュヒット回数とスレッショルドレジスタ２１３の値とを比較する（ステップＳ４０６）。キャッシュヒット回数がスレッショルドより小さい場合（ステップＳ４０６：Ｎｏ）、この部分列サイズは不適格と判定し、ステップＳ４０３に戻る。キャッシュヒット回数がスレッショルド以上の場合（ステップＳ：Ｙｅｓ）、ステップＳ４０７に移行する。 Returning to FIG. 4, the submatrix shape determination circuit 215 compares the number of sense amplifier cache hits calculated in step S405 with the value of the threshold register 213 (step S406). If the number of cache hits is smaller than the threshold (step S406: No), this subsequence size is determined to be ineligible, and the process returns to step S403. If the number of cache hits is greater than or equal to the threshold (step S: Yes), the process proceeds to step S407.

部分行列形状決定回路２１５は、対象行列の部分行列の列数ｐＣＯＬを部分列サイズに設定する（ステップＳ４０７）。部分行列形状決定回路２１５は、アレイサイズレジスタ２１１の値を、ステップＳ４０７で設定した列数ｐＣＯＬで割ることで、対象行列の部分行列の行数ｐＲＯＷを算出する（ステップＳ４０８）。 The submatrix shape determination circuit 215 sets the number of columns pCOL of the submatrix of the target matrix to the submatrix size (step S407). The submatrix shape determination circuit 215 calculates the number of rows pROW of the submatrix of the target matrix by dividing the value of the array size register 211 by the number of columns pCOL set in step S407 (step S408).

部分行列形状決定回路２１５は、横方向の部分行列数である横Ｈを求める（ステップＳ４０９）。具体的には、たとえば、部分行列形状決定回路２１５は、対象行列の列サイズレジスタの値を列数ｐＣＯＬで割り算し、切り上げることで、横Ｈを算出する。 The submatrix shape determination circuit 215 obtains the lateral H, which is the number of submatrixes in the lateral direction (step S409). Specifically, for example, the submatrix shape determination circuit 215 calculates the lateral H by dividing the value of the column size register of the target matrix by the number of columns pCOL and rounding up.

部分行列形状決定回路２１５は、縦方向の部分行列数である縦Ｖを求める（ステップＳ４１０）。具体的には、たとえば、部分行列形状決定回路２１５は、対象行列の行サイズレジスタの値を行数ｐＲＯＷで割り算し、切り上げしることで、縦Ｖを算出する。 The submatrix shape determination circuit 215 obtains the vertical V, which is the number of submatrixes in the vertical direction (step S410). Specifically, for example, the submatrix shape determination circuit 215 calculates the vertical V by dividing the value of the row size register of the target matrix by the number of rows pROW and rounding up.

部分行列形状決定回路２１５は、ステップＳ４０９で算出した横Ｈと、ステップＳ４１０で算出した縦Ｖとを乗算し、対象行列を覆う部分行列数Ｎを求める（ステップＳ４１１）。部分行列形状決定回路２１５は、部分行列数Ｎと最小アクセス回数とを比較する（ステップＳ４１２）。部分行列数Ｎが最小アクセス回数より大きい場合（ステップＳ４１２：Ｎｏ）、不適格として、ステップＳ４０３に戻る。それ以外の場合（ステップＳ４１２：Ｙｅｓ）、現在の部分行列数Ｎが適格として、ステップＳ４１３に移行する。 The submatrix shape determination circuit 215 multiplies the horizontal H calculated in step S409 by the vertical V calculated in step S410 to obtain the number N of submatrix covering the target matrix (step S411). The submatrix shape determination circuit 215 compares the number of submatrix N with the minimum number of accesses (step S412). If the number of submatrix N is larger than the minimum number of accesses (step S412: No), the process returns to step S403 as ineligible. In other cases (step S412: Yes), the current number of submatrix N is qualified and the process proceeds to step S413.

部分行列形状決定回路２１５は、変数である選択列数ｐＣＯＬに現在の列数ｐＣＯＬの値を代入して仮決定値とする（ステップＳ４１３）。部分行列形状決定回路２１５は、最小アクセス回数に現在の部分行列数Ｎを代入し（ステップＳ４１４）、ステップＳ４０３に戻る。このループにより、図１に示したように、部分行列として当初、第１部分行列１１１が選択されるが、最小アクセス数（Ｎ＝Ｖ×Ｈ）がより小さくなる第２部分行列１２１に更新されることになる。 The submatrix shape determination circuit 215 substitutes the value of the current number of columns pCOL into the variable number of selected columns pCOL to obtain a tentative determination value (step S413). The submatrix shape determination circuit 215 substitutes the current number of submatrix N for the minimum number of accesses (step S414), and returns to step S403. By this loop, as shown in FIG. 1, the first submatrix 111 is initially selected as the submatrix, but it is updated to the second submatrix 121 in which the minimum number of accesses (N = V × H) becomes smaller. Will be.

上述したように、ステップＳ４０４において、部分列サイズがアレイサイズ以上になると（ステップＳ４０４：Ｙｅｓ）、それ以上の部分列サイズを使用できないため、終了処理（ステップＳ４１５〜Ｓ４１８）に移行する。終了処理移行時における現在の選択列数ｐＣＯＬが今回の行列サイズに最適な列数ｐＣＯＬとなる。 As described above, when the subsequence size becomes equal to or larger than the array size in step S404 (step S404: Yes), since the subsequence size larger than that cannot be used, the process proceeds to the end process (steps S415 to S418). The current number of selected columns pCOL at the time of transition to the end processing is the optimum number of columns pCOL for the current matrix size.

部分行列形状決定回路２１５は、決定した列数ｐＣＯＬとして、現在の選択列数ｐＣＯＬを設定する（ステップＳ４１５）。部分行列形状決定回路２１５は、アレイサイズレジスタ２１１の値を、ステップＳ４１５の列数ｐＣＯＬで割り算し、行数ｐＲＯＷの値を算出する（ステップＳ４１６）。 The submatrix shape determination circuit 215 sets the current number of selected columns pCOL as the determined number of columns pCOL (step S415). The submatrix shape determination circuit 215 divides the value of the array size register 211 by the number of columns pCOL in step S415 to calculate the value of the number of rows pROW (step S416).

部分行列形状決定回路２１５は、列サイズレジスタの値を、ステップＳ４１５の列数ｐＣＯＬで割り算し、切り上げることで、対象行列の横方向の部分行列数である横Ｈを算出する（ステップＳ４１７）。部分行列形状決定回路２１５は、行サイズレジスタの値をステップＳ４１６の行数ｐＲＯＷで割り算し、切り上げることで、対象行列の縦方向の部分行列数であるＶを算出する（ステップＳ４１８）。これにより、一連の処理が終了する。 The submatrix shape determination circuit 215 divides the value of the column size register by the number of columns pCOL in step S415 and rounds up to calculate the lateral H, which is the number of submatrixes in the horizontal direction of the target matrix (step S417). The submatrix shape determination circuit 215 divides the value of the row size register by the number of rows pROW in step S416 and rounds up to calculate V, which is the number of submatrixes in the vertical direction of the target matrix (step S418). As a result, a series of processes is completed.

＜部分行列形状決定回路２１５による部分行列を使用した乗算実行アルゴリズム＞
図６は、部分行列形状決定回路２１５による部分行列を使用した乗算実行アルゴリズムを示すフローチャートである。部分行列形状決定回路２１５は、０埋めを行う領域を指定する数値を算出する（ステップＳ６０１）。Ｃａは行列Ａの列方向の０埋め範囲の指定値であり、行列ＡＢサイズレジスタ２１４内のＡ列サイズレジスタの値をステップＳ３０３で決定された行列Ａの列数ｐＣＯＬで割り算した余りが設定される。０埋め範囲の指定値とは、そのときの部分行列がどこまで０埋めとなるかを指定する値である。 <Multiplication execution algorithm using submatrix by submatrix shape determination circuit 215>
FIG. 6 is a flowchart showing a multiplication execution algorithm using a submatrix by the submatrix shape determination circuit 215. The submatrix shape determination circuit 215 calculates a numerical value that specifies a region to be filled with zeros (step S601). Ca is a specified value of the zero-filled range in the column direction of the matrix A, and the remainder obtained by dividing the value of the column A size register in the matrix AB size register 214 by the number of columns pCOL of the matrix A determined in step S303 is set. To. The specified value of the zero-filled range is a value that specifies how far the submatrix at that time is zero-filled.

Ｒａは行列Ａの行方向の０埋め範囲の指定値であり、行列ＡＢサイズレジスタ２１４内のＡ行サイズレジスタの値をステップＳ３０３で決定された行列Ａの行数ｐＲＯＷで割り算した余りが設定される。Ｃｂは行列Ｂの列方向の０埋め範囲の指定値であり、行列ＡＢサイズレジスタ２１４内のＢ列サイズレジスタの値をステップＳ３０７での入れ替え後の行列Ｂの列数ｐＣＯＬで割り算した余りが設定される。Ｒｂは行列Ｂの行方向の０埋め範囲の指定値であり、行列ＡＢサイズレジスタ２１４内のＢ行サイズレジスタの値をステップＳ３０７での入れ替え後の行列Ｂの行数ｐＲＯＷで割り算した余りが設定される。 Ra is a specified value of the 0-filled range in the row direction of the matrix A, and the remainder obtained by dividing the value of the A row size register in the matrix AB size register 214 by the number of rows pROW of the matrix A determined in step S303 is set. To. Cb is a specified value of the 0-filled range in the column direction of the matrix B, and the remainder obtained by dividing the value of the B column size register in the matrix AB size register 214 by the number of columns pCOL of the matrix B after the replacement in step S307 is set. Will be done. Rb is a specified value of the 0-filled range in the row direction of the matrix B, and the remainder obtained by dividing the value of the B row size register in the matrix AB size register 214 by the number of rows pROW of the matrix B after the replacement in step S307 is set. Will be done.

部分行列形状決定回路２１５は、ループ変数Ａｒ，Ｂｃ，Ａｃをすべて０に初期設定する（ステップＳ６０２）。部分行列形状決定回路２１５は、図３および図４の処理で決定した部分行列に該当するデータを行列Ａおよび行列Ｂからそれぞれ読み込み、読み込みバッファ２１８に転送する（ステップＳ６０３）。部分行列形状決定回路２１５は、ループ変数ＡｃとＨａ−１とを比較し、同一であれば行列Ａの端と判定し（ステップＳ６０４：Ｙｅｓ）、ステップＳ６０５に移行し、それ以外であれば（ステップＳ６０４：Ｎｏ）、ステップＳ６０６に移行する。 The submatrix shape determination circuit 215 initially sets all the loop variables Ar, Bc, and Ac to 0 (step S602). The submatrix shape determination circuit 215 reads the data corresponding to the submatrix determined in the processes of FIGS. 3 and 4 from the matrix A and the matrix B, respectively, and transfers the data to the read buffer 218 (step S603). The submatrix shape determination circuit 215 compares the loop variables Ac and Ha-1, determines that they are the ends of the matrix A if they are the same (step S604: Yes), proceeds to step S605, and otherwise (steps to step S605). Step S604: No), the process proceeds to step S606.

部分行列形状決定回路２１５は、Ｃａ有効フラグをＯＮに設定し、Ｒｂ有効フラグをＯＮに設定し（ステップＳ６０５）、ステップＳ６０６に移行する。これにより、行列Ａの列の０埋めパラメータにＣａが設定され、行列Ｂの行の０埋めパラメータにＲｂが設定されることになる。 The submatrix shape determination circuit 215 sets the Ca valid flag to ON, sets the Rb valid flag to ON (step S605), and proceeds to step S606. As a result, Ca is set in the zero-filled parameter of the column of the matrix A, and Rb is set in the zero-filled parameter of the row of the matrix B.

部分行列形状決定回路２１５は、ループ変数ＢｃとＨｂ−１とを比較し、同一であれば行列Ｂの端と判定し（ステップＳ６０６：Ｙｅｓ）、ステップＳ６０７に移行し、それ以外であれば（ステップＳ６０６：Ｎｏ）、ステップＳ６０８に移行する。部分行列形状決定回路２１５は、Ｃｂ有効フラグをＯＮに設定し（ステップＳ６０７）、ステップＳ６０８に移行する。これにより、行列Ｂの列の０埋めパラメータにＣｂが設定されることになる。 The submatrix shape determination circuit 215 compares the loop variables Bc and Hb-1, determines that they are the ends of the matrix B if they are the same (step S606: Yes), proceeds to step S607, and otherwise (steps to step S607). Step S606: No), the process proceeds to step S608. The submatrix shape determination circuit 215 sets the Cb valid flag to ON (step S607), and proceeds to step S608. As a result, Cb is set in the zero-filled parameter of the column of the matrix B.

部分行列形状決定回路２１５は、ループ変数ＡｒとＶａ−１とを比較し、同一であれば行列Ａの端と判定し（ステップＳ６０８：Ｙｅｓ）、ステップＳ６０９に移行し、それ以外であれば（ステップＳ６０８：Ｎｏ）、ステップＳ６１０に移行する。部分行列形状決定回路２１５は、Ｒａ有効フラグをＯＮに設定し（ステップＳ６０９）、ステップＳ６１０に移行する。これにより、行列Ａの行の０埋めパラメータにＲａが設定されることになる。 The submatrix shape determination circuit 215 compares the loop variables Ar and Va-1, determines that they are the ends of the matrix A if they are the same (step S608: Yes), proceeds to step S609, and otherwise (steps to step S609). Step S608: No), the process proceeds to step S610. The submatrix shape determination circuit 215 sets the Ra valid flag to ON (step S609), and proceeds to step S610. As a result, Ra is set in the zero-filled parameter of the row of the matrix A.

部分行列形状決定回路２１５は、部分行列の乗算処理開始の処理を行う（ステップＳ６１０）。これにより、行列演算アレイ２２１が動作し、部分行列分の乗算が行われる。部分行列形状決定回路２１５は、ループ変数Ａｃの値をインクリメントする（ステップＳ６１１）。部分行列形状決定回路２１５は、ループ変数Ａｃと行列Ａの縦Ｖａとを比較し（ステップＳ６１２）、同一であれば（ステップＳ６１２：Ｙｅｓ）、ステップＳ６１３に移行し、それ以外は（ステップＳ６１２：Ｎｏ）、ステップＳ６１４に移行する。ステップＳ６１３では、部分行列形状決定回路２１５は、ループ変数Ｂｃの値をインクリメントし、ループ変数Ａｃに０を設定して（ステップＳ６１３）、ステップＳ６１４に移行する。 The submatrix shape determination circuit 215 performs a process of starting the multiplication process of the submatrix (step S610). As a result, the matrix operation array 221 operates and multiplication for the submatrix is performed. The submatrix shape determination circuit 215 increments the value of the loop variable Ac (step S611). The submatrix shape determination circuit 215 compares the loop variable Ac with the vertical Va of the matrix A (step S612), and if they are the same (step S612: Yes), the process proceeds to step S613, and otherwise (step S612:). No), the process proceeds to step S614. In step S613, the submatrix shape determination circuit 215 increments the value of the loop variable Bc, sets the loop variable Ac to 0 (step S613), and proceeds to step S614.

部分行列形状決定回路２１５は、ループ変数Ｂｃと行列Ｂの横Ｈｂとを比較し（ステップＳ６１４）、同一であれば（ステップＳ６１４：Ｙｅｓ）、ステップＳ６１５に移行し、それ以外は（ステップＳ６１４：Ｎｏ）、ステップＳ６１６に移行する。ステップＳ６１５では、部分行列形状決定回路２１５は、ループ変数Ａｒをインクリメントし、ループ変数Ｂｃに０を設定して（ステップＳ６１５）、ステップＳ６１６に移行する。 The submatrix shape determination circuit 215 compares the loop variable Bc with the lateral Hb of the matrix B (step S614), and if they are the same (step S614: Yes), the process proceeds to step S615, and otherwise (step S614: No), the process proceeds to step S616. In step S615, the submatrix shape determination circuit 215 increments the loop variable Ar, sets the loop variable Bc to 0 (step S615), and proceeds to step S616.

部分行列形状決定回路２１５は、ループ変数Ａｒと行列Ａの縦Ｖａとを比較し（ステップＳ６１６）、同一であれば（ステップＳ６１６：Ｙｅｓ）、全ての処理が終了する。それ以外であれば（ステップＳ６１６：Ｎｏ）、ステップＳ６０３に戻り、処理が継続する。 The submatrix shape determination circuit 215 compares the loop variable Ar with the vertical Va of the matrix A (step S616), and if they are the same (step S616: Yes), all the processes are completed. Otherwise (step S616: No), the process returns to step S603 and the process continues.

＜部分行列形状決定回路２１５以降の処理＞
図７は、部分行列形状決定回路２１５以降の処理例を示す説明図である。加算制御回路２１６は、部分行列形状決定回路２１５から、Ｃａ、Ｒａ、ＣｂおよびＲｂの入力を受け付ける。また、加算制御回路２１６は、部分行列形状決定回路２１５から、Ｃａ有効フラグ７０１、Ｒａ有効フラグ７０２、Ｃｂ有効フラグ７０３およびＲｂ有効フラグ７０４の入力を受け付ける。これら有効フラグは、そのときの部分行列に０埋めがあるか無いかを示すフラグである。また、加算制御回路２１６は、部分行列形状決定回路２１５から、ステップＳ３０８で選択された部分行列の列数ｐＣＯＬと、ステップＳ３０９で選択された部分行列の行数ｐＲＯＷの入力を受け付ける。 <Processing after submatrix shape determination circuit 215>
FIG. 7 is an explanatory diagram showing a processing example after the submatrix shape determination circuit 215. The addition control circuit 216 receives inputs of Ca, Ra, Cb and Rb from the submatrix shape determination circuit 215. Further, the addition control circuit 216 receives inputs of the Ca valid flag 701, the Ra valid flag 702, the Cb valid flag 703, and the Rb valid flag 704 from the submatrix shape determination circuit 215. These valid flags are flags indicating whether or not the submatrix at that time has zero padding. Further, the addition control circuit 216 receives input from the submatrix shape determination circuit 215 of the number of columns pCOL of the submatrix selected in step S308 and the number of rows pROW of the submatrix selected in step S309.

読み込みバッファ２１８は、ＤＲＡＭ２０５から行列Ａデータ２０７の部分行列であるＡ００，Ａ０１，…を読み込み、行列Ｂデータ２０８の部分行列であるＢ００，Ｂ０１，…を保持する。部分行列であるＡ００，Ａ０１，…は、部分行列のサイズの数だけ存在する。同様に、部分行列であるＢ００，Ｂ０１，…は、部分行列のサイズの数だけ存在する。読み込みバッファ２１８は、すべての部分行列の各々に対して読み出しポートを持ち、すべての部分行列を一度に読み出せるよう構成されている。 The read buffer 218 reads the submatrix A00, A01, ... Of the matrix A data 207 from the DRAM 205, and holds the submatrix B00, B01, ... Of the matrix B data 208. There are as many submatrixes A00, A01, ... As there are submatrix sizes. Similarly, there are as many submatrixes B00, B01, ... As there are submatrix sizes. The read buffer 218 has a read port for each of all submatrixes and is configured to read all submatrixes at once.

セレクタアレイ７０５は、複数のセレクタ２２０を有する。各セレクタ２２０は、読み込みバッファ２１８からの部分行列と０値とのいずれかを選択する。セレクタ２２０は、読み込みバッファ２１８内の部分行列の個数分存在する。たとえば、各セレクタ２２０は、加算制御回路２１６から０埋めの指示があれば０値を選択し、０埋めの指示がなければ読み込みバッファ２１８からの部分行列を選択する。各セレクタ２２０は、選択したデータを行列演算アレイ２２１の乗算器７０６に出力する。 The selector array 705 has a plurality of selectors 220. Each selector 220 selects either a submatrix from the read buffer 218 or a zero value. There are as many selectors 220 as there are submatrixes in the read buffer 218. For example, each selector 220 selects a 0 value if there is an instruction to fill in 0 from the addition control circuit 216, and selects a submatrix from the read buffer 218 if there is no instruction to fill in 0. Each selector 220 outputs the selected data to the multiplier 706 of the matrix operation array 221.

行列演算アレイ２２１は、複数の乗算器７０６と複数の加算器７０７とにより構成される。乗算器７０６は、行列演算アレイ２２１の中で乗算を実行する。行列Ａデータ２０７の部分行列１個および行列Ｂデータ２０８の部分行列１個の組み合わせに対して乗算器７０６は１個存在し、乗算結果を出力する。加算器７０７は、乗算器７０６の乗算結果を集計する。複数の加算器７０７はツリー状に配置される。詳細は図８で説明する。 The matrix operation array 221 is composed of a plurality of multipliers 706 and a plurality of adders 707. The multiplier 706 performs multiplication in the matrix operation array 221. One multiplier 706 exists for a combination of one submatrix of the matrix A data 207 and one submatrix of the matrix B data 208, and outputs the multiplication result. The adder 707 aggregates the multiplication results of the multiplier 706. The plurality of adders 707 are arranged in a tree shape. Details will be described with reference to FIG.

書き出しバッファ２２３は、行列演算アレイ２２１からの結果Ｃデータ２０９の部分行列Ｃ００，Ｃ０１，Ｃ０２，Ｃ０３，…を保持する。セレクタ７０８は、書き出しバッファ２２３内に保存されている結果Ｃデータ２０９の部分行列を選択してＤＲＡＭ２０５に書き出す。 The write buffer 223 holds submatrixes C00, C01, C02, C03, ... Of the result C data 209 from the matrix operation array 221. The selector 708 selects a submatrix of the result C data 209 stored in the write buffer 223 and writes it to the DRAM 205.

部分行列形状決定回路２１５以降の動作は、以下の通りである。まず、加算制御回路２１６は、Ｃａ、Ｒａ、ＣｂおよびＲｂと、Ｃａ有効フラグ７０１、Ｒａ有効フラグ７０２、Ｃｂ有効フラグ７０３およびＲｂ有効フラグ７０４と、列数ｐＣＯＬおよび行数ｐＲＯＷとを、部分行列形状決定回路２１５から取得する。加算制御回路２１６は、取得したデータに基づいて、セレクタアレイ７０５の各セレクタ２２０のセレクト値を設定する。 The operation after the submatrix shape determination circuit 215 is as follows. First, the addition control circuit 216 submatrixes Ca, Ra, Cb and Rb, Ca valid flag 701, Ra valid flag 702, Cb valid flag 703 and Rb valid flag 704, and the number of columns pCOL and the number of rows pROW. Obtained from the shape determination circuit 215. The addition control circuit 216 sets the select value of each selector 220 of the selector array 705 based on the acquired data.

つぎに、セレクタアレイ７０５は、読み込みバッファ２１８から行列Ａの部分行列Ａ００，Ａ０１，…および行列Ｂの部分行列Ｂ００，Ｂ０１，…をすべて読み出す。セレクタアレイ７０５は、セレクト値に応じて０値または部分行列のいずれかを選択し、選択結果を行列演算アレイ２２１に出力する。この時点でデータの０埋めが完了する。 Next, the selector array 705 reads all the sub-matrixes A00, A01, ... Of the matrix A and the sub-matrix B00, B01, ... Of the matrix B from the read buffer 218. The selector array 705 selects either a 0 value or a submatrix according to the select value, and outputs the selection result to the matrix operation array 221. At this point, zero padding of the data is completed.

つぎに、行列演算アレイ２２１は、行列Ａの部分行列と行列Ｂの部分行列とを乗算器７０６で乗算し、加算器７０７で加算する。行列演算アレイ２２１は、加算器７０７からの出力である結果Ｃデータ２０９の部分行列を書き出しバッファ２２３に書き出す。そして、セレクタ７０８が選択してＤＲＡＭ２０５へ書き出す。 Next, the matrix operation array 221 multiplies the submatrix of the matrix A and the submatrix of the matrix B by the multiplier 706 and adds them by the adder 707. The matrix operation array 221 writes the submatrix of the result C data 209, which is the output from the adder 707, to the write buffer 223. Then, the selector 708 selects and writes to the DRAM 205.

＜行列演算アレイ２２１の動作例＞
図８は、行列演算アレイ２２１の動作例を示す説明図である。図８では、説明の単純化のため、一例として、アレイサイズが１６（乗算器７０６が１６個）の場合について説明するが、２のべき乗であれば、アレイサイズは１６に限定されない。行列演算アレイ２２１は、アレイサイズである１６組の行列Ａと行列Ｂの要素が並び、これらのペアに対して乗算器７０６で乗算結果を得る。この後、隣接する乗算結果同士を加算する加算器７０７のツリーが構成される。行列演算アレイ２２１は、ツリー状の加算器７０７の最上位にセレクタ８０１Ａ〜８０１Ｄを有する。 <Operation example of matrix operation array 221>
FIG. 8 is an explanatory diagram showing an operation example of the matrix operation array 221. In FIG. 8, for simplification of the description, a case where the array size is 16 (16 multipliers 706) will be described as an example, but if it is a power of 2, the array size is not limited to 16. In the matrix operation array 221, 16 sets of matrix A and matrix B elements having an array size are arranged, and a multiplier 706 is used to obtain a multiplication result for these pairs. After that, a tree of adder 707 that adds the adjacent multiplication results to each other is constructed. The matrix operation array 221 has selectors 801A to 801D at the uppermost level of the tree-shaped adder 707.

セレクタ８０１Ａは、２個の乗算器７０６からの乗算結果を加算した加算結果と、４個の乗算器７０６からの乗算結果を加算した加算結果と、８個の乗算器７０６からの乗算結果を加算した加算結果と、１６個の乗算器７０６からの乗算結果を加算した加算結果と、を選択的にＤＲＡＭ２０５に書き出す。 Selector 801A adds the addition result of adding the multiplication results from the two multipliers 706, the addition result of adding the multiplication results from the four multipliers 706, and the multiplication result from the eight multipliers 706. The added result and the added result obtained by adding the multiplication results from the 16 multipliers 706 are selectively written to the DRAM 205.

セレクタ８０１Ｂは、２個の乗算器７０６からの乗算結果を加算した加算結果と、４個の乗算器７０６からの乗算結果を加算した加算結果と、８個の乗算器７０６からの乗算結果を加算した加算結果と、を選択的にＤＲＡＭ２０５に書き出す。 Selector 801B adds the addition result of adding the multiplication results from the two multipliers 706, the addition result of adding the multiplication results from the four multipliers 706, and the multiplication result from the eight multipliers 706. The added result and the result of the addition are selectively written to the DRAM 205.

セレクタ８０１Ｃ，８０１Ｄは、２個の乗算器７０６からの乗算結果を加算した加算結果と、４個の乗算器７０６からの乗算結果を加算した加算結果と、を選択的にＤＲＡＭ２０５に書き出す。セレクタ８０１Ａ〜８０１Ｄで選択されない２個の乗算器７０６からの乗算結果を加算した加算結果は、その加算器７０７からＤＲＡＭ２０５に書き出される。 The selectors 801C and 801D selectively write the addition result obtained by adding the multiplication results from the two multipliers 706 and the addition result obtained by adding the multiplication results from the four multipliers 706 to the DRAM 205. The addition result obtained by adding the multiplication results from the two multipliers 706 not selected by the selectors 801A to 801D is written from the adder 707 to the DRAM 205.

符号８１６は、ステップＳ３０８で選択された部分行列の列数ｐＣＯＬが１６の場合に選択されるデータ範囲であり、符号８０８は、ステップＳ３０８で選択された部分行列の列数ｐＣＯＬが８の場合に選択されるデータ範囲であり、符号８０４は、ステップＳ３０８で選択された部分行列の列数ｐＣＯＬが４の場合に選択されるデータ範囲であり、符号８０２は、ステップＳ３０８で選択された部分行列の列数ｐＣＯＬが２の場合に選択されるデータ範囲である。 Reference numeral 816 is a data range selected when the number of columns pCOL of the submatrix selected in step S308 is 16, and reference numeral 808 is a case where the number of columns pCOL of the submatrix selected in step S308 is 8. Reference numeral 804 is a data range to be selected, reference numeral 804 is a data range selected when the number of columns pCOL of the submatrix selected in step S308 is 4, and reference numeral 802 is a submatrix selected in step S308. This is the data range selected when the number of columns pCOL is 2.

すなわち、セレクタ８０１Ａ〜８０１Ｄは、列数ｐＣＯＬの値により、加算結果を選択する。このように構成することで、列数ｐＣＯＬが１６の場合には、行列演算アレイ２２１は、セレクタ８０１Ａにより、４段目の加算器７０７からの加算結果を出力するよう動作する。また、列数ｐＣＯＬが８の場合には、行列演算アレイ２２１は、セレクタ８０１Ａ，８０１Ｂにより、３段目の２個の加算器７０７からの２個の加算結果を出力するよう動作する。 That is, the selectors 801A to 801D select the addition result according to the value of the number of columns pCOL. With this configuration, when the number of columns pCOL is 16, the matrix operation array 221 operates so as to output the addition result from the fourth-stage adder 707 by the selector 801A. When the number of columns pCOL is 8, the matrix operation array 221 operates so as to output two addition results from the two adders 707 in the third stage by the selectors 801A and 801B.

また、列数ｐＣＯＬが４の場合には、行列演算アレイ２２１は、セレクタ８０１Ａ〜８０１Ｄにより、２段目の４個の加算器７０７からの４個の加算結果を出力するよう動作する。また、列数ｐＣＯＬが２の場合には、行列演算アレイ２２１は、セレクタ８０１Ａ〜８０１Ｄにより、１段目の８個の加算器７０７からの８個の加算結果を出力するよう動作する。 When the number of columns pCOL is 4, the matrix operation array 221 operates so as to output the four addition results from the four adders 707 in the second stage by the selectors 801A to 801D. When the number of columns pCOL is 2, the matrix operation array 221 operates so as to output eight addition results from the eight adders 707 in the first stage by the selectors 801A to 801D.

＜レジスタ群の詳細＞
図９は、レジスタ群の詳細を示すブロック図である。レジスタ群とは、上述したアレイサイズレジスタ２１１、センスアンプキャッシュサイズレジスタ２１２、スレッショルドレジスタ２１３、行列ＡＢサイズレジスタ２１４および行列演算開始レジスタ９０７である。 <Details of register group>
FIG. 9 is a block diagram showing details of the register group. The register group includes the array size register 211, the sense amplifier cache size register 212, the threshold register 213, the matrix AB size register 214, and the matrix operation start register 907 described above.

アドレスデコーダ９０１は、汎用バス２０３を経由してプロセッサ２０１からのアドレスを取得してライトイネーブル信号ｗｅｎを各レジスタに出力する。データラッチ９０２は、汎用バス２０３を経由してプロセッサ２０１からのデータを取得して、目的のレジスタに書き込まれるまで、データを保持する。 The address decoder 901 acquires the address from the processor 201 via the general-purpose bus 203 and outputs the write enable signal we to each register. The data latch 902 acquires data from the processor 201 via the general-purpose bus 203 and holds the data until it is written to a target register.

具体的には、たとえば、アドレスデコーダ９０１およびデータラッチ９０２は、プロセッサ２０１からのアドレスとデータの対に対して、アドレスに対応するレジスタにデータを書き込む。アドレスデコーダ９０１がその対応を取るよう動作し、それぞれのレジスタに対する書込みを識別し、それぞれのレジスタにライトイネーブル信号ｗｅｎを出力する。 Specifically, for example, the address decoder 901 and the data latch 902 write data to the register corresponding to the address for the address-data pair from the processor 201. The address decoder 901 operates to take that correspondence, identifies the write to each register, and outputs a write enable signal we to each register.

行列ＡＢサイズレジスタ２１４は、行列Ａの列サイズを指定する行列Ａ列サイズレジスタ９０３と、行列Ａの行サイズを指定する行列Ａ行サイズレジスタ９０４と、行列Ｂの列サイズを指定する行列Ｂ列サイズレジスタ９０５と、行列Ｂの行サイズを指定する行列Ｂ行サイズレジスタ９０６と、を有する。 The matrix AB size register 214 includes a matrix A column size register 903 that specifies the column size of the matrix A, a matrix A row size register 904 that specifies the row size of the matrix A, and a matrix B column that specifies the column size of the matrix B. It has a size register 905 and a matrix B row size register 906 that specifies the row size of the matrix B.

行列演算開始レジスタ９０７は、データとして開始値が書き込まれるレジスタである。開始値が書き込まれると、一連の行列演算の実行が開始される。このように構成することで、プロセッサ２０１から各レジスタにデータを書き込むことが可能となり、プロセッサ２０１からのＦＰＧＡ２０４の制御が可能となる。 The matrix operation start register 907 is a register in which the start value is written as data. When the start value is written, the execution of a series of matrix operations is started. With this configuration, it is possible to write data from the processor 201 to each register, and it is possible to control the FPGA 204 from the processor 201.

＜制御プログラム２０６による制御処理＞
図１０は、制御プログラム２０６による制御処理例を示すフローチャートである。プロセッサ２０１は、アレイサイズレジスタ２１１にアレイサイズを設定する（ステップＳ１００１）。プロセッサ２０１は、センスアンプキャッシュサイズレジスタ２１２にセンスアンプキャッシュサイズを設定する（ステップＳ１００２）。プロセッサ２０１は、スレッショルドレジスタ２１３に許容されるセンスアンプキャッシュのヒット回数の下限値を設定する（ステップＳ１００３）。 <Control processing by control program 206>
FIG. 10 is a flowchart showing an example of control processing by the control program 206. The processor 201 sets the array size in the array size register 211 (step S1001). The processor 201 sets the sense amplifier cache size in the sense amplifier cache size register 212 (step S1002). The processor 201 sets the lower limit of the number of hits of the sense amplifier cache allowed in the threshold register 213 (step S1003).

プロセッサ２０１は、行列演算を使用するメインプログラム（たとえば、ディープラーニング）からの呼び出しを待つ（ステップＳ１００４：Ｎｏ）。呼び出されたときに（ステップＳ１００４：Ｙｅｓ）、ステップＳ１００５に移行する。 Processor 201 waits for a call from a main program (eg, deep learning) that uses matrix operations (step S1004: No). When called (step S1004: Yes), the process proceeds to step S1005.

プロセッサ２０１は、メインプログラムから指定される行列Ａの列サイズを行列Ａ列サイズレジスタ９０３に設定する（ステップＳ１００５）。プロセッサ２０１は、メインプログラムから指定される行列Ａの行サイズを行列Ａ行サイズレジスタ９０４に設定する（ステップＳ１００６）。プロセッサ２０１は、メインプログラムから指定される行列Ｂの列サイズを行列Ｂ列サイズレジスタ９０５に設定する（ステップＳ１００７）。プロセッサ２０１は、メインプログラムから指定される行列Ｂの行サイズを行列Ｂ行サイズレジスタ９０６に設定する（ステップＳ１００８）。 The processor 201 sets the column size of the matrix A specified by the main program in the matrix A column size register 903 (step S1005). The processor 201 sets the row size of the matrix A specified by the main program in the matrix A row size register 904 (step S1006). The processor 201 sets the column size of the matrix B specified by the main program in the matrix B column size register 905 (step S1007). The processor 201 sets the row size of the matrix B specified by the main program in the matrix B row size register 906 (step S1008).

プロセッサ２０１は、行列Ａデータ２０７のポインタをＤＭＡＣ２２４に設定する（ステップＳ１００９）。プロセッサ２０１は、行列Ｂデータ２０８のポインタをＤＭＡＣ２２４に設定する（ステップＳ１０１０）。プロセッサ２０１は、行列演算開始レジスタ９０７に開始値を書き込み、ＦＰＧＡ２０４に行列演算を開始させる（ステップＳ１０１１）。このときまず、ＤＭＡＣ２２４は行列Ａデータ２０７をＤＲＡＭ２０５へ転送する。つぎに、ＤＭＡＣ２２４は、行列Ｂデータ２０８をＤＲＡＭ２０５へ転送する。このとき、行列Ｂデータ２０８は列と行とを転置して転送される。このあと、部分行列形状決定回路２１５が起動し、実行され、列数ｐＣＯＬ等の設定が計算される。その設定を使用して行列演算が実行される。 The processor 201 sets the pointer of the matrix A data 207 to the DMAC224 (step S1009). The processor 201 sets the pointer of the matrix B data 208 to the DMAC224 (step S1010). The processor 201 writes a start value in the matrix operation start register 907 and causes the FPGA 204 to start the matrix operation (step S1011). At this time, first, the DMAC 224 transfers the matrix A data 207 to the DRAM 205. Next, the DMAC224 transfers the matrix B data 208 to the DRAM 205. At this time, the matrix B data 208 is transferred by transposing the columns and rows. After that, the submatrix shape determination circuit 215 is started and executed, and the setting of the number of columns pCOL and the like is calculated. Matrix operations are performed using that setting.

プロセッサ２０１は、ＦＰＧＡ２０４での行列演算が終了するのを待つ（ステップＳ１０１２）。プロセッサ２０１は、結果Ｃデータ２０９のポインタをＤＭＡＣ２２４に設定する（ステップＳ１０１３）。プロセッサ２０１は、ＤＭＡＣ２２４を起動して、ＤＲＡＭ２０５にある結果Ｃデータ２０９をメモリ２０２に転送する（ステップＳ１０１４）。プロセッサ２０１は、ＤＭＡ終了を待ち、終了したら、行列演算終了を返す。これで１回の行列演算は終了し、ステップＳ１００４に戻り次の呼び出し待ちに入る。 Processor 201 waits for the matrix operation in FPGA 204 to finish (step S1012). The processor 201 sets the pointer of the result C data 209 to DMAC224 (step S1013). The processor 201 activates the DMAC 224 and transfers the result C data 209 in the DRAM 205 to the memory 202 (step S1014). The processor 201 waits for the end of DMA, and when it finishes, returns the end of the matrix operation. This completes one matrix operation, returns to step S1004, and waits for the next call.

＜ログ出力部２２５の制御処理＞
図１１は、ログ出力部２２５の制御処理例を示す説明図である。ログデータ２１０は、ログ内タイムスタンプ１１０１と、ログ内行列Ａサイズ１１０２と、ログ内行列Ｂサイズ１１０３と、ログ内部分行列サイズ１１０４と、を含む。ログ内タイムスタンプは、ログを生成した時刻、つまり部分行列サイズが決定された時刻を示す。ログ内行列Ａサイズは、部分行列のサイズ決定に使用した行列Ａのサイズを示す。ログ内行列Ｂサイズは、部分行列のサイズ決定に使用した行列Ｂのサイズを示す。ログ内部分行列サイズは、この時点で決定した部分行列サイズを示す。 <Control processing of log output unit 225>
FIG. 11 is an explanatory diagram showing an example of control processing of the log output unit 225. The log data 210 includes an in-log time stamp 1101, an in-log matrix A size 1102, an in-log matrix B size 1103, and an in-log sub-matrix size 1104. The in-log time stamp indicates the time when the log was generated, that is, the time when the submatrix size was determined. The matrix A size in the log indicates the size of the matrix A used to determine the size of the submatrix. The in-log matrix B size indicates the size of the matrix B used to size the submatrix. The submatrix size in the log indicates the submatrix size determined at this point.

また、ログ出力部２２５は、ログ生成部１１０５と、ログ一時保存メモリ１１０６と、を有する。ログ生成部１１０５は、部分行列形状決定回路２１５から必要な情報として、たとえば、行列Ａサイズ、行列Ｂサイズ、および部分行列サイズを取得し、演算装置２００から取得する時刻と共に現在のログデータ２１０を生成し、ログ一時保存メモリ１１０６に転送する。また、ログ生成部１１０５は、ログデータ２１０をメモリ２０２に転送するためのＤＭＡＣ２２４の起動を行う。ログ一時保存メモリ１１０６は、ログ生成部１１０５が生成したログデータ２１０をＤＭＡＣ２２４が読み出すまで保存する。 Further, the log output unit 225 has a log generation unit 1105 and a log temporary storage memory 1106. The log generation unit 1105 acquires, for example, the matrix A size, the matrix B size, and the submatrix size as necessary information from the submatrix shape determination circuit 215, and obtains the current log data 210 together with the time acquired from the arithmetic unit 200. Generate and transfer to the log temporary storage memory 1106. Further, the log generation unit 1105 activates the DMAC 224 for transferring the log data 210 to the memory 202. The log temporary storage memory 1106 stores the log data 210 generated by the log generation unit 1105 until the DMAC224 reads it.

ログ出力部２２５は以下のように動作する。部分行列形状決定回路２１５が部分行列の形状を決定したタイミングで、ログ生成部１１０５が、時刻、行列Ａサイズ、行列Ｂサイズ、および部分行列サイズからなるログデータ２１０を生成し、ログ一時保存メモリ１１０６に保存する。同時に、ログ生成部１１０５は、ＤＭＡＣ２２４を起動し、一時保存されたログデータ２１０をメモリ２０２に転送する。これにより、行列演算実行毎に決定される部分行列サイズがどのような行列Ａおよび行列Ｂのサイズから生成されたかが記録される。 The log output unit 225 operates as follows. At the timing when the submatrix shape determination circuit 215 determines the shape of the submatrix, the log generation unit 1105 generates log data 210 composed of time, matrix A size, matrix B size, and submatrix size, and temporarily stores the log. Store in 1106. At the same time, the log generation unit 1105 activates the DMAC 224 and transfers the temporarily saved log data 210 to the memory 202. As a result, it is recorded from what size of matrix A and matrix B the submatrix size determined for each matrix operation execution is generated.

以上のように構成することで、オフロードデバイス上で、最適な部分行列の形状を使用した行列演算が実行され、様々なサイズの行列の乗算を高速に行うことが可能となる。つまり同じ回路構成でより高速に、またより電力効率良く演算を行うことが可能となる。 With the above configuration, matrix operations using the optimum submatrix shape are executed on the offload device, and matrix multiplication of various sizes can be performed at high speed. That is, it is possible to perform calculations at higher speed and with higher power efficiency with the same circuit configuration.

また、上述した演算装置２００は、下記（１）〜（１１）のように構成することもできる。 Further, the above-mentioned arithmetic unit 200 can also be configured as described in (1) to (11) below.

（１）行列データ（行列Ａデータ２０７、行列Ｂデータ２０８）を記憶するＤＲＡＭ２０５にアクセス可能な演算装置２００は、部分行列形状決定回路２１５と、行列演算アレイ２２１と、を有する。 (1) The arithmetic apparatus 200 that can access the DRAM 205 that stores the matrix data (matrix A data 207, matrix B data 208) has a partial matrix shape determination circuit 215 and a matrix arithmetic array 221.

部分行列形状決定回路２１５は、行列データの分割単位となる部分行列の行方向の列数である部分列サイズを増加させ、増加後の部分列サイズが複数の乗算器７０６の個数以上となった場合、増加前の部分列サイズと複数の乗算器７０６の個数（アレイサイズ）とに基づいて、部分行列の列方向の行数である部分行サイズを求め、部分列サイズ（ｐＣＯＬ）と部分行サイズ（ｐＲＯＷ）とにより構成される部分行列の形状（ｐＣＯＬ，ｐＲＯＷ）を決定する。 The sub-matrix shape determination circuit 215 increases the sub-column size, which is the number of columns in the row direction of the sub-matrix, which is the division unit of the matrix data, and the increased sub-column size becomes equal to or more than the number of the plurality of multipliers 706. In the case, the subsequence size, which is the number of rows in the column direction of the submatrix, is obtained based on the subcolumn size before the increase and the number of the plurality of multipliers 706 (array size), and the subsequence size (pCOL) and the subsequence are obtained. The shape (pCOL, pROW) of the subsequence composed of the size (pROW) is determined.

行列演算アレイ２２１は、複数の乗算器７０６と複数の加算器７０７とを有し、ＤＲＡＭ２０５に記憶されている行列Ａデータ２０７のうち部分行列形状決定回路２１５によって決定された形状の第１部分行列（たとえば、Ａ００）と、ＤＲＡＭ２０５に記憶されている行列Ｂデータ２０８のうち部分行列形状決定回路２１５によって決定された形状の第２部分行列（たとえば、Ｂ００）と、の積和演算を実行する。 The matrix calculation array 221 has a plurality of multipliers 706 and a plurality of adders 707, and is a first submatrix having a shape determined by the submatrix shape determination circuit 215 of the matrix A data 207 stored in the DRAM 205. (For example, A00) and the second submatrix (for example, B00) having a shape determined by the submatrix shape determination circuit 215 of the matrix B data 208 stored in the DRAM 205 are subjected to a multiply-accumulate operation.

これにより、行列演算アレイ２２１は、部分行列の形状により、複数の乗算器７０６および複数の加算器７０７の行方向への割り振りと列方向への割り振りの割合に設定することができ、行列演算の高速化を図ることができる。 Thereby, the matrix operation array 221 can be set to the ratio of the allocation in the row direction and the allocation in the column direction of the plurality of multipliers 706 and the plurality of adders 707 according to the shape of the submatrix. The speed can be increased.

（２）上記（１）の演算装置２００において、部分行列形状決定回路２１５は、行列Ａデータ２０７について決定された第１部分行列（たとえば、Ａ００）の部分列サイズと、行列Ｂデータ２０８について決定された第２部分行列（たとえば、Ｂ００）の部分行サイズと、のうち、大きい方の形状を、部分行列の形状に決定する（Ｓ３０８、Ｓ３０９）。 (2) In the arithmetic unit 200 of the above (1), the submatrix shape determination circuit 215 determines the submatrix size of the first submatrix (for example, A00) determined for the matrix A data 207 and the matrix B data 208. The larger of the submatrix size of the second submatrix (for example, B00) is determined to be the shape of the submatrix (S308, S309).

これにより、行列Ａデータ２０７および行列Ｂデータ２０８に適用可能な共通の部分行列の形状を得ることができる。 Thereby, a common submatrix shape applicable to the matrix A data 207 and the matrix B data 208 can be obtained.

（３）上記（１）の演算装置２００において、部分行列形状決定回路２１５は、部分列サイズと、行列データの行方向の列数である列サイズと、に基づいて、複数の部分行列で行列データを包含する領域の行方向のサイズ（横Ｈ）を決定し、部分行サイズと、行列データの列方向の列数である行サイズと、に基づいて、複数の部分行列で行列データを包含する領域の列方向のサイズ（縦Ｖ）を決定する。 (3) In the arithmetic unit 200 of the above (1), the submatrix shape determination circuit 215 is a matrix with a plurality of submatrix based on the subcolumn size and the column size which is the number of columns in the row direction of the matrix data. The size of the area containing the data in the row direction (horizontal H) is determined, and the matrix data is included in a plurality of sub-matrix based on the partial row size and the row size which is the number of columns in the column direction of the matrix data. Determine the size (vertical V) of the area to be used in the column direction.

これにより、可能な限り行列データを部分行列で埋めることができ、行列演算回数の低減化を図ることができる。 As a result, the matrix data can be filled with submatrix as much as possible, and the number of matrix operations can be reduced.

（４）上記（３）の演算装置２００は、領域のうち部分行列が割り当てられていない範囲に０値を割り当てる加算制御回路２１６を有し、行列演算アレイ２２１は、加算制御回路２１６による０値の割当後に積和演算を実行する。これにより、行列演算に与える悪影響を抑制することができる。 (4) The arithmetic unit 200 of the above (3) has an addition control circuit 216 that allocates a 0 value to a range in which a submatrix is not assigned, and the matrix operation array 221 has a 0 value by the addition control circuit 216. The product-sum operation is executed after the allocation of. As a result, the adverse effect on the matrix operation can be suppressed.

（５）上記（３）の演算装置２００は、行列データが入力される都度、列サイズおよび行サイズを設定する行列ＡＢサイズレジスタ２１４を有し、部分行列形状決定回路２１５は、行列データが入力される都度、部分行列の形状と、領域の行方向のサイズ（横Ｈ）と、領域の列方向のサイズ（縦Ｖ）と、を決定する。 (5) The arithmetic unit 200 of the above (3) has a matrix AB size register 214 for setting the column size and the row size each time the matrix data is input, and the submatrix shape determination circuit 215 inputs the matrix data. Each time the submatrix is formed, the shape of the submatrix, the size of the area in the row direction (horizontal H), and the size of the area in the column direction (vertical V) are determined.

これにより、入力行列データごとに、入力行列データの形状に応じて、複数の乗算器７０６および複数の加算器７０７の最適な割り振りを行って、行列演算を動的かつ効率的に実行することができる。 As a result, it is possible to perform the matrix operation dynamically and efficiently by optimally allocating the plurality of multipliers 706 and the plurality of adders 707 for each input matrix data according to the shape of the input matrix data. it can.

（６）上記（１）の演算装置２００において、部分行列形状決定回路２１５は、ＤＲＡＭ２０５が有するセンスアンプのキャッシュサイズと、行列データの行方向の列数である列サイズと、部分列サイズと、に基づいて、センスアンプのキャッシュヒット回数を求め、キャッシュヒット回数がしきい値（スレッショルドレジスタ２１３の値）以上でない場合、部分列サイズを増加させる（Ｓ４０５、Ｓ４０６：Ｎｏ、Ｓ４０３）。 (6) In the arithmetic unit 200 of the above (1), the submatrix shape determination circuit 215 has a cache size of the sense amplifier of the DRAM 205, a column size which is the number of columns in the row direction of the matrix data, a subcolumn size, and the like. The number of cache hits of the sense amplifier is obtained based on the above, and if the number of cache hits is not equal to or greater than the threshold value (value of threshold register 213), the submatrix size is increased (S405, S406: No, S403).

ＤＲＡＭ２０５のアクセス時間は条件により変化する。同じセンスアンプキャッシュにある同一行のデータについては、高速にアクセス可能である一方、違う行のデータについては、現在センスアンプキャッシュに存在するデータを書き戻し、対象の行のデータを読み出す処理を行うため、最大で１０倍のアクセス時間となる。このように、アクセスパターンによりアクセス時間が大きく異なるため、部分行列の形状の下限値としてキャッシュヒット回数を算出することにより、センスアンプキャッシュサイズに応じて部分行列の形状を変更することができる。 The access time of the DRAM 205 changes depending on the conditions. The data in the same row in the same sense amplifier cache can be accessed at high speed, while the data in different rows is written back to the data currently in the sense amplifier cache and the data in the target row is read out. Therefore, the access time is up to 10 times longer. As described above, since the access time varies greatly depending on the access pattern, the shape of the submatrix can be changed according to the sense amplifier cache size by calculating the number of cache hits as the lower limit of the shape of the submatrix.

（７）上記（６）の演算装置２００において、部分行列形状決定回路２１５は、キャッシュヒット回数がしきい値以上である場合、増加前の部分列サイズ（ｐＣＯＬ）と複数の乗算器７０６の個数（アレイサイズ）とに基づいて、部分行列の列方向の行数である部分行サイズを求め、部分列サイズ（ｐＣＯＬ）と部分行サイズ（ｐＬＯＷ）とにより部分行列の形状を仮決定し（ステップＳ４０７、Ｓ４０８）、仮決定された部分列サイズを増加させ、増加後の部分列サイズが複数の乗算器７０６の個数（アレイサイズ）以上となった場合、仮決定された部分列サイズ（ｐＣＯＬ）と複数の乗算器７０６の個数（アレイサイズ）とに基づいて、部分行列の列方向の行数である部分行サイズを求め、部分列サイズと部分行サイズとにより部分行列の形状を決定する（ステップＳ４１５、Ｓ４１６）。 (7) In the arithmetic unit 200 of the above (6), when the number of cache hits is equal to or greater than the threshold value, the subsequence shape determination circuit 215 has the subsequence size (pCOL) before the increase and the number of the plurality of multipliers 706. Based on (array size), the subsequence size, which is the number of rows in the column direction of the subsequence, is obtained, and the shape of the subsequence is tentatively determined by the subsequence size (pCOL) and the subsequence size (pLOW) (step). S407, S408), when the tentatively determined subsequence size is increased and the increased subsequence size is equal to or greater than the number (array size) of a plurality of multipliers 706, the tentatively determined subsequence size (pCOL) Based on the above and the number of multipliers 706 (array size), the subsequence size, which is the number of rows in the column direction of the subsequence, is obtained, and the shape of the subsequence is determined by the subsequence size and the subsequence size (). Steps S415 and S416).

これにより、部分行列形状決定回路２１５は、部分列サイズがアレイサイズ以上になるまで、キャッシュヒット回数がスレッショルド以上でかつＤＲＡＭ２０５へのアクセス回数が最小となるような部分列サイズを探索することができる。したがって、行列演算回数の低減化を図ることができる。 As a result, the submatrix shape determination circuit 215 can search for a subsequence size such that the number of cache hits is equal to or greater than the threshold and the number of accesses to the DRAM 205 is minimized until the subsequence size becomes equal to or greater than the array size. .. Therefore, the number of matrix operations can be reduced.

（８）上記（７）の演算装置２００において、部分行列形状決定回路２１５は、部分列サイズと、行列データの行方向の列数である列サイズと、に基づいて、複数の部分行列で行列データを包含する領域の行方向のサイズ（横Ｈ）を仮決定し、部分行サイズと、行列データの列方向の列数である行サイズと、に基づいて、複数の部分行列で行列データを包含する領域の列方向のサイズ（縦Ｖ）を仮決定し、仮決定された両サイズ（横Ｈ、縦Ｖ）の乗算結果Ｎが行列データへの最小アクセス回数以下である場合、最小アクセス回数を乗算結果Ｎに更新する（ステップＳ４０９〜Ｓ４１４）。 (8) In the arithmetic unit 200 of the above (7), the submatrix shape determination circuit 215 is a matrix with a plurality of submatrix based on the subcolumn size and the column size which is the number of columns in the row direction of the matrix data. The size of the area containing the data in the row direction (horizontal H) is tentatively determined, and the matrix data is divided into a plurality of sub-matrix based on the partial row size and the row size which is the number of columns in the column direction of the matrix data. The size of the area to be included in the column direction (vertical V) is tentatively determined, and when the multiplication result N of both tentatively determined sizes (horizontal H and vertical V) is equal to or less than the minimum number of accesses to the matrix data, the minimum number of accesses. Is updated to the multiplication result N (steps S409 to S414).

これにより、部分行列形状決定回路２１５は、部分列サイズがアレイサイズ以上になるまで、キャッシュヒット回数がスレッショルド以上でかつＤＲＡＭ２０５へのアクセス回数が最小となるような領域の行方向のサイズ（横Ｈ、縦Ｖ）を探索することができる。したがって、行列データのサイズに対して０埋めすべき領域を低減化することができ、ＤＲＡＭ２０５からの読出し効率の向上を図ることができる。 As a result, the submatrix shape determination circuit 215 has a row size (horizontal H) of the region in which the number of cache hits is equal to or greater than the threshold and the number of accesses to the DRAM 205 is minimized until the subsequence size becomes equal to or greater than the array size. , Vertical V) can be searched. Therefore, the area to be filled with 0 can be reduced with respect to the size of the matrix data, and the reading efficiency from the DRAM 205 can be improved.

（９）上記（１）の演算装置２００は、ＦＰＧＡ２０４である。これにより、プロセッサ２０１の負荷低減を図ることができる。 (9) The arithmetic unit 200 of the above (1) is FPGA 204. As a result, the load on the processor 201 can be reduced.

（１０）上記（１）の演算装置２００は、部分行列形状決定回路２１５および行列演算アレイ２２１を含むＦＰＧＡ２０４と、ＦＰＧＡ２０４を制御するための制御プログラム２０６を実行するプロセッサ２０１と、制御プログラム２０６と行列演算アレイ２２１による行列演算結果（結果Ｃデータ）とを記憶するメモリ２０２と、を有する。これにより、ＦＰＧＡ２０４を制御することがかのとなる。 (10) The arithmetic unit 200 of the above (1) includes an FPGA 204 including a submatrix shape determination circuit 215 and a matrix operation array 221, a processor 201 that executes a control program 206 for controlling the FPGA 204, a control program 206, and a matrix. It has a memory 202 for storing a matrix operation result (result C data) by the operation array 221. As a result, the FPGA 204 can be controlled.

（１１）上記（１０）の演算装置２００において、ＦＰＧＡ２０４は、部分行列形状決定回路２１５の実行内容を示すログデータ２１０を生成するログ生成部１１０５を含み、メモリ２０２は、ログ生成部１１０５によって生成されたログデータ２１０を記憶する。これにより、部分行列形状決定回路２１５の実行内容を確認することができる。 (11) In the arithmetic unit 200 of the above (10), the FPGA 204 includes a log generation unit 1105 that generates log data 210 indicating the execution contents of the submatrix shape determination circuit 215, and the memory 202 is generated by the log generation unit 1105. The log data 210 that has been created is stored. Thereby, the execution contents of the submatrix shape determination circuit 215 can be confirmed.

なお、本発明は上記した実施例に限定されるものではなく、様々な変形例が含まれる。例えば、上記した実施例は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明したすべての構成を備えるものに限定されるものではない。また、ある実施例の構成の一部を他の実施例の構成に置き換えることが可能であり、また、ある実施例の構成に他の実施例の構成を加えることも可能である。また、各実施例の構成の一部について、他の構成の追加・削除・置換をすることが可能である。 The present invention is not limited to the above-described examples, and includes various modifications. For example, the above-described embodiment has been described in detail in order to explain the present invention in an easy-to-understand manner, and is not necessarily limited to the one including all the configurations described. Further, it is possible to replace a part of the configuration of one embodiment with the configuration of another embodiment, and it is also possible to add the configuration of another embodiment to the configuration of one embodiment. Further, it is possible to add / delete / replace a part of the configuration of each embodiment with another configuration.

また、制御線や情報線は説明上必要と考えられるものを示しており、実装上必要な全ての制御線や情報線を示しているとは限らない。実際には、ほとんど全ての構成が相互に接続されていると考えてよい。 In addition, the control lines and information lines indicate those that are considered necessary for explanation, and do not necessarily indicate all the control lines and information lines necessary for implementation. In practice, it can be considered that almost all configurations are interconnected.

２００演算装置
２０１プロセッサ
２０２メモリ
２０５ＤＲＡＭ
２０６制御プログラム
２０７行列Ａデータ
２０８行列Ｂデータ
２０９結果Ｃデータ
２１０ログデータ
２１１アレイサイズレジスタ
２１２センスアンプキャッシュサイズレジスタ
２１３スレッショルドレジスタ
２１４行列ＡＢサイズレジスタ
２１５部分行列形状決定回路
２１６加算制御回路
２２１行列演算アレイ
２２５ログ出力部
７０６乗算器
７０７加算器 200 Arithmetic Logic Unit 201 Processor 202 Memory 205 DRAM
206 Control program 207 Matrix A data 208 Matrix B data 209 Result C data 210 Log data 211 Array size register 212 Sense amplifier cache size register 213 Threshold register 214 Matrix AB size register 215 Partial matrix shape determination circuit 216 Adder control circuit 221 Adder operation array 225 Log output unit 706 Multiplier 707 Adder

Claims

An arithmetic unit that can access a storage device that stores matrix data.
When the subcolumn size, which is the number of columns in the row direction of the submatrix that is the division unit of the matrix data, is increased and the subcolumn size after the increase becomes equal to or greater than the number of a plurality of multipliers, the subcolumn before the increase Based on the size and the number of the plurality of multipliers, the partial row size, which is the number of rows in the column direction of the submatrix, is obtained, and the shape of the submatrix composed of the submatrix size and the partial row size. And the decision-making part that decides
The first submatrix having the shape determined by the determination unit among the first matrix data stored in the storage device and having the plurality of multipliers and the plurality of adders, and stored in the storage device. Of the second matrix data, the matrix calculation unit that executes the product-sum operation of the second submatrix of the shape determined by the determination unit, and
An arithmetic unit characterized by having.

The arithmetic unit according to claim 1.
The determination unit has the larger shape of the submatrix size of the first submatrix determined for the first matrix data and the submatrix size of the second submatrix determined for the second matrix data. To the shape of the submatrix,
An arithmetic unit characterized by that.

The arithmetic unit according to claim 1.
Based on the subcolumn size and the column size which is the number of columns in the row direction of the matrix data, the determination unit determines the size of the area including the matrix data in the plurality of the submatrix in the row direction. Based on the partial row size and the row size which is the number of columns of the matrix data in the column direction, the size of the region including the matrix data in the plurality of the sub-matrix is determined in the column direction. decide,
An arithmetic unit characterized by that.

The arithmetic unit according to claim 3.
It has a control unit that assigns a 0 value to a range of the area to which the submatrix is not assigned.
The matrix calculation unit executes the product-sum operation after the zero value is assigned by the control unit.
An arithmetic unit characterized by that.

The arithmetic unit according to claim 3.
It has a setting unit for setting the column size and the row size each time the matrix data is input.
Each time the matrix data is input, the determination unit determines the shape of the submatrix, the size of the region in the row direction, and the size of the region in the column direction.
An arithmetic unit characterized by that.

The arithmetic unit according to claim 1.
The determination unit determines the number of cache hits of the sense amplifier based on the cache size of the sense amplifier of the storage device, the column size which is the number of columns in the row direction of the matrix data, and the subsequence size. If the number of cache hits is not equal to or greater than the threshold value, the subsequence size is increased.
An arithmetic unit characterized by that.

The arithmetic unit according to claim 6.
When the number of cache hits is equal to or greater than the threshold value, the determination unit is the number of rows in the column direction of the submatrix based on the subsequence size before the increase and the number of the plurality of multipliers. The submatrix size is obtained, and the shape of the submatrix is tentatively determined by the substring size and the submatrix size.
When the tentatively determined subsequence size is increased and the increased subsequence size becomes equal to or greater than the number of a plurality of multipliers, it is based on the tentatively determined subsequence size and the number of the plurality of multipliers. The subsequence size, which is the number of rows in the column direction of the subsequence, is obtained, and the shape of the subsequence is determined by the subsequence size and the subsequence size.
An arithmetic unit characterized by that.

The arithmetic unit according to claim 7.
The determination unit is based on the subcolumn size and the column size which is the number of columns of the matrix data in the row direction, and the size of the region including the matrix data in the plurality of the submatrix in the row direction. Is tentatively determined, and based on the partial row size and the row size which is the number of columns in the column direction of the matrix data, the size of the region including the matrix data in the plurality of the sub-matrix in the column direction. Is tentatively determined, and when the tentatively determined multiplication result of both sizes is equal to or less than the minimum number of accesses to the matrix data, the minimum number of accesses is updated to the multiplication result.
An arithmetic unit characterized by that.

The arithmetic unit according to claim 1.
The arithmetic unit is an offload device.
An arithmetic unit characterized by that.

The arithmetic unit according to claim 1.
An offload device including the determination unit and the matrix calculation unit, and
A processor that executes a control program for controlling the offload device, and
A memory for storing the control program and the matrix calculation result by the matrix calculation unit, and
An arithmetic unit characterized by having.

The arithmetic unit according to claim 10.
The offload device includes a generation unit that generates log data indicating the execution content of the determination unit.
The memory stores log data generated by the generation unit.
An arithmetic unit characterized by that.

An arithmetic method performed by an arithmetic unit that has access to a storage device that stores matrix data.
The arithmetic unit
When the subcolumn size, which is the number of columns in the row direction of the submatrix that is the division unit of the matrix data, is increased and the subcolumn size after the increase becomes equal to or greater than the number of a plurality of multipliers, the subcolumn before the increase Based on the size and the number of the plurality of multipliers, the partial row size, which is the number of rows in the column direction of the submatrix, is obtained, and the shape of the submatrix composed of the submatrix size and the partial row size. Decide,
A first submatrix having the determined shape among the first matrix data stored in the storage device, which has the plurality of multipliers and the plurality of adders, and a first submatrix stored in the storage device. The product-sum operation of the second submatrix of the determined shape of the two matrix data is executed.
A calculation method characterized by that.