JP7241397B2

JP7241397B2 - Arithmetic Device, Arithmetic Method, and Arithmetic Program

Info

Publication number: JP7241397B2
Application number: JP2019107283A
Authority: JP
Inventors: 淳一郎牧野; 俊一戎崎
Original assignee: RIKEN Institute of Physical and Chemical Research
Current assignee: RIKEN Institute of Physical and Chemical Research
Priority date: 2019-06-07
Filing date: 2019-06-07
Publication date: 2023-03-17
Anticipated expiration: 2039-06-07
Also published as: WO2020246598A1; JP2020201659A

Description

本発明は、演算装置、演算方法、および演算プログラムに関する。 The present invention relates to an arithmetic device, an arithmetic method, and an arithmetic program.

例えば数値計算および深層学習といった種々の応用において、行列行列積（以下、「行列積」と示す。）および行列ベクトル積は、計算量の大部分を占める。このため、このような行列演算を効率良く実行する演算装置および演算方法が開発されている（特許文献１～３参照）。また、行列演算を実行可能なプロセッサも開発されている。
［先行技術文献］
［特許文献］
［特許文献１］国際公開第２０１８／２０７９２６号
［特許文献２］特開２０１８－１３９０４５号公報
［特許文献３］特開２０１８－１９７９０６号公報 In various applications such as numerical computation and deep learning, matrix-matrix multiplication (hereinafter referred to as "matrix multiplication") and matrix-vector multiplication occupy most of the computational complexity. Therefore, an arithmetic device and an arithmetic method for efficiently executing such matrix arithmetic have been developed (see Patent Documents 1 to 3). Processors have also been developed that are capable of performing matrix operations.
[Prior art documents]
[Patent Literature]
[Patent Document 1] International Publication No. 2018/207926 [Patent Document 2] JP-A-2018-139045 [Patent Document 3] JP-A-2018-197906

ｎ次元正方行列およびｎ次元ベクトルの行列ベクトル積は、ｎ^２の乗算および約ｎ^２の加算を含み、約２ｎ^２の演算量となる。このため、ｎ次元正方行列が固定である場合、行列ベクトル積の演算量は、ｎ次元ベクトルの入力に対してｎ^２オーダーとなる。したがって、行列サイズを大きくして行列演算器を大きくすれば、演算量に対するデータのロード量の比率を小さくすることができる。しかし、行列演算器を大きくすると、レジスタファイル等のロード／ストア能力が相対的に低くなり、サイズが小さい行列の演算および行列以外の演算の処理性能が相対的に低くなってしまう。 The matrix-vector product of an n-dimensional square matrix and an n-dimensional vector involves n ² multiplications and approximately n ² additions, resulting in a computational complexity of approximately 2n ² . Therefore, when the n-dimensional square matrix is fixed, the amount of computation of the matrix-vector product becomes n ² orders for the input of the n-dimensional vector. Therefore, if the matrix size is increased to increase the size of the matrix calculator, the ratio of the amount of data load to the amount of calculation can be reduced. However, if the matrix calculator is made larger, the load/store capability of the register file or the like becomes relatively low, and the processing performance of small-sized matrix operations and operations other than matrices becomes relatively low.

上記課題を解決するために、本発明の第１の態様においては、演算装置を提供する。演算装置は、第１ベクトルを分割した第１の複数の部分ベクトルのうち、第１部分ベクトルを少なくとも記憶するベクトル記憶部を備えてよい。演算装置は、第１ベクトルに乗じる第１行列を行方向および列方向に分割した第１の複数の部分行列のうち、第１部分ベクトルに乗じるべき第１部分行列を少なくとも記憶する行列記憶部を備えてよい。演算装置は、パイプライン演算により、行列記憶部に記憶された部分行列とベクトル記憶部に記憶された部分ベクトルとの行列ベクトル積に、中間ベクトルを加える演算を実行可能なパイプライン演算部を備えてよい。演算装置は、パイプライン演算部が、第１部分行列および第１部分ベクトルの行列ベクトル積のパイプライン演算中に、第１部分ベクトルまたは第１部分行列を用いた他の行列ベクトル積の演算の実行をパイプライン演算部に指示する演算制御部を備えてよい。 In order to solve the above problems, a first aspect of the present invention provides an arithmetic device. The arithmetic device may include a vector storage unit that stores at least the first partial vector among the first plurality of partial vectors obtained by dividing the first vector. The arithmetic unit has a matrix storage unit that stores at least a first submatrix by which the first partial vector is to be multiplied, among a plurality of first submatrices obtained by dividing the first matrix by which the first vector is to be multiplied in the row direction and the column direction. Be prepared. The arithmetic unit includes a pipeline arithmetic unit capable of executing an arithmetic operation of adding an intermediate vector to the matrix-vector product of the partial matrix stored in the matrix storage unit and the partial vector stored in the vector storage unit. you can In the arithmetic device, the pipeline operation unit performs another matrix-vector product operation using the first partial vector or the first partial matrix during the pipeline operation of the matrix-vector product of the first partial matrix and the first partial vector. An arithmetic control unit may be provided for instructing execution to the pipeline arithmetic unit.

ベクトル記憶部は、第１の複数の部分ベクトルのうち、第２部分ベクトルを更に記憶してよい。行列記憶部は、第１の複数の部分行列のうち、第２部分ベクトルに乗じるべき第２部分行列を更に記憶してよい。演算制御部は、第１部分行列および第１部分ベクトルの行列ベクトル積の演算結果が遅延なく利用可能となるサイクル以降に、第２部分行列および第２部分ベクトルの行列ベクトル積を、第１部分行列および第１部分ベクトルの行列ベクトル積の演算結果に加える演算の実行をパイプライン演算部に指示してよい。 The vector storage unit may further store the second partial vector among the first plurality of partial vectors. The matrix storage unit may further store a second partial matrix by which the second partial vector is to be multiplied, among the first plurality of partial matrices. After the cycle in which the operation result of the matrix-vector product of the first partial matrix and the first partial vector becomes available without delay, the operation control unit performs the matrix-vector product of the second partial matrix and the second partial vector as the first partial matrix. The pipeline operation unit may be instructed to perform an operation to be added to the operation result of the matrix-vector product of the matrix and the first partial vector.

ベクトル記憶部は、第１行列を乗じるべき第２ベクトルを分割した第２の複数の部分ベクトルのうち、第１部分行列を乗じるべき第３部分ベクトルを更に記憶してよい。演算制御部は、第１部分行列および第１部分ベクトルの行列ベクトル積のパイプライン演算中に、他の行列ベクトル積の演算として、第１部分行列および第３部分ベクトルの行列ベクトル積の演算の実行をパイプライン演算部に指示してよい。 The vector storage unit may further store a third partial vector by which the first partial matrix is to be multiplied, among a plurality of second partial vectors obtained by dividing the second vector by which the first matrix is to be multiplied. During the pipeline operation of the matrix-vector product of the first partial matrix and the first partial vector, the operation control unit performs the operation of the matrix-vector product of the first partial matrix and the third partial vector as another matrix-vector product operation. Execution may be instructed to the pipeline operation unit.

第１ベクトルおよび第２ベクトルは、第１行列に乗じるべき第２行列に含まれる列ベクトルであってよい。 The first vector and the second vector may be column vectors contained in the second matrix to be multiplied by the first matrix.

ベクトル記憶部は、第２行列に含まれる複数の第２ベクトルを記憶してよい。演算制御部は、第１部分行列および第１部分ベクトルの行列ベクトル積のパイプライン演算の開始後から演算結果が遅滞なく利用可能となる前までの間の各サイクルを、第１部分行列および複数の第２ベクトルのそれぞれからの第３部分ベクトルの行列ベクトル積の演算で充填してよい。 The vector storage unit may store a plurality of second vectors included in the second matrix. The operation control unit controls each cycle from the start of the pipeline operation of the matrix-vector product of the first submatrix and the first partial vector until the operation result becomes available without delay for the first submatrix and the plurality of submatrices. may be filled with a matrix-vector product operation of the third subvector from each of the second vectors of .

行列記憶部は、第１の複数の部分行列のうち、第１部分ベクトルに乗じるべき第３部分行列を更に記憶してよい。演算制御部は、第１部分行列および第１部分ベクトルの行列ベクトル積のパイプライン演算中に、他の行列ベクトル積の演算として、第３部分行列および第１部分ベクトルの行列ベクトル積の演算の実行をパイプライン演算部に指示してよい。 The matrix storage unit may further store a third submatrix by which the first subvector is to be multiplied, among the plurality of first submatrices. During the pipeline operation of the matrix-vector product of the first partial matrix and the first partial vector, the operation control unit performs the operation of the matrix-vector product of the third partial matrix and the first partial vector as another matrix-vector product operation. Execution may be instructed to the pipeline operation unit.

行列記憶部は、複数の第３部分行列を記憶してよい。演算制御部は、第１部分行列および第１部分ベクトルの行列ベクトル積のパイプライン演算の開始後から演算結果が遅滞なく利用可能となる前までの間の各サイクルを、複数の第３部分行列のそれぞれおよび第１部分ベクトルの行列ベクトル積の演算で充填してよい。 The matrix storage unit may store a plurality of third submatrices. The operation control unit controls each cycle from the start of the pipeline operation of the matrix-vector product of the first submatrix and the first partial vector until the operation result becomes available without delay for the plurality of third submatrices. and a matrix-vector product operation of each of and the first subvector.

本発明の第２の態様においては、演算方法を提供する。演算方法は、ベクトル記憶部が、第１ベクトルを分割した第１の複数の部分ベクトルのうち、第１部分ベクトルを少なくとも記憶することを備えてよい。演算方法は、行列記憶部が、第１ベクトルに乗じる第１行列を行方向および列方向に分割した第１の複数の部分行列のうち、第１部分ベクトルに乗じるべき第１部分行列を少なくとも記憶することを備えてよい。演算方法は、パイプライン演算により、行列記憶部に記憶された部分行列とベクトル記憶部に記憶された部分ベクトルとの行列ベクトル積に、中間ベクトルを加える演算を実行可能なパイプライン演算部が、第１部分行列および第１部分ベクトルの行列ベクトル積のパイプライン演算中に、第１部分ベクトルまたは第１部分行列を用いた他の行列ベクトル積の演算の実行を開始することを備えてよい。 In a second aspect of the invention, a computation method is provided. The computing method may include storing at least the first partial vector among the first plurality of partial vectors obtained by dividing the first vector in the vector storage unit. In the calculation method, the matrix storage unit stores at least a first submatrix by which the first subvector is to be multiplied, among a plurality of submatrices obtained by dividing the first matrix by which the first vector is to be multiplied in the row direction and the column direction. be prepared to do so. The computation method includes a pipeline computation unit capable of executing a computation of adding an intermediate vector to a matrix-vector product of a partial matrix stored in the matrix storage unit and a partial vector stored in the vector storage unit, During the pipeline operation of the first sub-matrix and the matrix-vector product of the first sub-vector, it may comprise initiating execution of another matrix-vector product operation with the first sub-vector or the first sub-matrix.

本発明の第３の態様においては、演算装置によって実行される演算プログラムを提供する。演算装置は、第１ベクトルを分割した第１の複数の部分ベクトルのうち、第１部分ベクトルを少なくとも記憶するベクトル記憶部を備えてよい。演算装置は、第１ベクトルに乗じる第１行列を行方向および列方向に分割した第１の複数の部分行列のうち、第１部分ベクトルに乗じるべき第１部分行列を少なくとも行列記憶部を備えてよい。演算装置は、パイプライン演算により、行列記憶部に記憶された部分行列とベクトル記憶部に記憶された部分ベクトルとの行列ベクトル積に、中間ベクトルを加える演算を実行可能なパイプライン演算部を備えてよい。演算プログラムは、演算装置に、第１部分行列および第１部分ベクトルの行列ベクトル積のパイプライン演算中に、第１部分ベクトルまたは第１部分行列を用いた他の行列ベクトル積の演算の実行を開始させるためのものであってよい。 A third aspect of the present invention provides a computing program executed by a computing device. The arithmetic device may include a vector storage unit that stores at least the first partial vector among the first plurality of partial vectors obtained by dividing the first vector. The arithmetic unit stores at least a first submatrix by which the first partial vector is to be multiplied, among a plurality of first submatrices obtained by dividing the first matrix by which the first vector is to be multiplied, in the row direction and the column direction. good. The arithmetic unit includes a pipeline arithmetic unit capable of executing an arithmetic operation of adding an intermediate vector to the matrix-vector product of the partial matrix stored in the matrix storage unit and the partial vector stored in the vector storage unit. you can The computing program causes the computing device to perform another matrix-vector product operation using the first partial vector or the first submatrix during the pipeline operation of the first submatrix and the matrix-vector product of the first partial vector. It can be for getting started.

なお、上記の発明の概要は、本発明の必要な特徴の全てを列挙したものではない。また、これらの特徴群のサブコンビネーションもまた、発明となりうる。 It should be noted that the above summary of the invention does not list all the necessary features of the invention. Subcombinations of these feature groups can also be inventions.

本実施形態に係る行列演算の一例を示す。4 shows an example of matrix computation according to the present embodiment. 本実施形態に係る行列演算を、部分行列および部分ベクトルの行列ベクトル積に分解した計算式の一例を示す。An example of a calculation formula in which the matrix operation according to the present embodiment is decomposed into a matrix-vector product of partial matrices and partial vectors is shown. 本実施形態に係る演算装置３００の構成を示す。3 shows the configuration of an arithmetic device 300 according to the present embodiment. 本実施形態に係る演算装置３００によるパイプライン処理の第１例を示す。A first example of pipeline processing by the arithmetic device 300 according to the present embodiment is shown. 本実施形態に係る演算装置３００によるパイプライン処理の第２例を示す。A second example of pipeline processing by the arithmetic device 300 according to the present embodiment is shown. 本実施形態に係る演算装置３００によるパイプライン処理の第３例を示す。A third example of pipeline processing by the arithmetic device 300 according to the present embodiment is shown. 本実施形態に係る演算装置３００によるパイプライン処理の第４例を示す。A fourth example of pipeline processing by the arithmetic device 300 according to the present embodiment is shown. 図８は、本発明の複数の態様が全体的または部分的に具現化されてよいコンピュータ２２００の例を示す。FIG. 8 illustrates an example computer 2200 in which aspects of the invention may be implemented in whole or in part.

以下、発明の実施の形態を通じて本発明を説明するが、以下の実施形態は特許請求の範囲にかかる発明を限定するものではない。また、実施形態の中で説明されている特徴の組み合わせの全てが発明の解決手段に必須であるとは限らない。 Hereinafter, the present invention will be described through embodiments of the invention, but the following embodiments do not limit the invention according to the claims. Also, not all combinations of features described in the embodiments are essential for the solution of the invention.

図１は、本実施形態に係る行列演算の一例を示す。本図は、行列Ａおよび行列Ｂの行列積を計算し、行列Ｃに代入する行列演算Ｃ＝Ａ×Ｂを示す。行列Ａ、Ｂ、およびＣは、８行８列の正方行列である。 FIG. 1 shows an example of matrix computation according to this embodiment. This figure shows the matrix operation C=A×B that computes the matrix product of matrix A and matrix B and substitutes into matrix C. Matrices A, B, and C are square matrices with 8 rows and 8 columns.

ａ_ｉｊ（ｉ＝１，２，…８、ｊ＝１，２，…８）は、行列Ａの成分（「要素」とも示す。）である。ｂ_ｉｊ（ｉ＝１，２，…８、ｊ＝１，２，…８）は、行列Ｂの成分である。ｃ_ｉｊ（ｉ＝１，２，…８、ｊ＝１，２，…８）は、行列Ｃの成分である。ｊ≧３の範囲について、ｂ_ｉｊおよびｃ_ｉｊの各成分の図示は省略している。 a _ij (i=1, 2, . . . 8, j=1, 2, . . . 8) are the elements of matrix A (also referred to as “elements”). b _ij (i=1, 2, . . . 8, j=1, 2, . . . 8) are the elements of matrix B; c _ij (i=1, 2, . . . 8, j=1, 2, . . . 8) are the elements of matrix C; For the range of j≧3, illustration of each component of b _ij and c _ij is omitted.

また、行列Ｂのｊ列の列ベクトル、すなわち行列Ｂのｊ列の各成分ｂ_ｉｊ（ｉ＝１，２，…８）を成分とするベクトルをベクトルｖｂｊ、行列Ｃのｊ列の列ベクトルをベクトルｖｃｊと示す。すなわち、ベクトルｖｂｊ＝（ｂ_１ｊ，ｂ_２ｊ，…，ｂ_８ｊ）^Ｔ、ベクトルｖｃｊ＝（ｃ_１ｊ，ｃ_２ｊ，…，ｃ_８ｊ）^Ｔと示す。このとき、ベクトルｖｃｊは、行列Ａおよびベクトルｖｂｊの行列ベクトル積ｖｃｊ＝Ａ×ｖｂｊによって計算できる。 Further, the column vector of the j column of the matrix B, that is, the vector having the components b _ij (i=1, 2, . . . 8) of the j column of the matrix B is the vector vbj, Denote vector vcj. That is, vector vbj=( ^b _1j , b _2j , . . . , b _8j ) ^T and vector vcj= ₍ c _1j , c _2j , . Then vector vcj can be calculated by matrix-vector product vcj=A*vbj of matrix A and vector vbj.

ここで、例えば４行４列の行列と４要素のベクトルとの行列ベクトル積を１単位の演算として実行可能な演算装置を用いる場合、行列演算Ｃ＝Ａ×Ｂを、演算装置が一度に演算できる単位に分割して行う。本図において、行列Ａを、行列Ａを行方向および列方向にそれぞれ２分割して得られる部分行列をＡ１１、Ａ２１、Ａ１２、およびＡ２２と示す。部分行列Ａｍｎ（ｍ＝１，２、ｎ＝１，２）は、行列Ａにおける、行方向に分割したｍ番目の行範囲、および列方向に分割したｎ番目の列範囲の成分を、その部分行列の成分とする。また、ベクトルｖｂｊを、行方向に２分割して得られる部分ベクトルをｖｂ１ｊおよびｖｂ２ｊと示す。ｖｂｍｊ（ｍ＝１，２）は、ベクトルｖｂｊにおける、行方向に分割したｍ番目の行範囲の成分を、その部分ベクトルの成分とする。また、ベクトルｖｃｊを、行方向に２分割して得られる部分ベクトルをｖｃ１ｊおよびｖｃ２ｊと示す。ｖｃｍｊ（ｍ＝１，２）は、ベクトルｖｃｊにおける、行方向に分割したｍ番目の行範囲の成分を、その部分ベクトルの成分とする。 Here, for example, when using an arithmetic unit capable of executing a matrix-vector product of a matrix of 4 rows and 4 columns and a vector of 4 elements as a unit of operation, the arithmetic unit calculates the matrix operation C=A×B at one time. Divide into units that can be done. In the figure, submatrices obtained by dividing the matrix A into two in the row direction and the column direction are indicated by A11, A21, A12, and A22. Submatrix Amn (m = 1, 2, n = 1, 2) is a submatrix containing the components of the m-th row range divided in the row direction and the n-th column range divided in the column direction in the matrix A. be a matrix element. Partial vectors obtained by dividing the vector vbj into two in the row direction are denoted by vb1j and vb2j. vbmj (m=1, 2) is the component of the m-th row range obtained by dividing the vector vbj in the row direction as the component of the partial vector. Partial vectors obtained by dividing the vector vcj into two in the row direction are denoted by vc1j and vc2j. For vcmj (m=1, 2), the components of the m-th row range divided in the row direction in the vector vcj are the components of the partial vector.

なお、本図においては、行列演算の一例として行列積を示し、行列ベクトル積については行列積の一部に含まれるものとして説明した。行列積の演算に含まれない行列ベクトル積については、行列Ｂおよび行列Ｃの第１列に関する行列ベクトル積ｖｃ１＝Ａ×ｖｂ１等と同様である。また、本実施形態においては、行列Ａ、Ｂ、およびＣは、行方向および列方向に２のべき乗個の要素を有し、行列Ａが行方向および列方向において２のべき乗個に分割される場合について例示する。これに代えて、行列Ａ、Ｂ、およびＣは、行方向または列方向の少なくとも１つについて２のべき乗個以外の数の要素を有してもよく、行列Ａが行方向または列方向の少なくとも１つについて２のべき乗個以外の数に分割されてもよい（例えば３×３、５×５、９×９、３×５、５×９等）。 In this figure, matrix multiplication is shown as an example of matrix operation, and matrix-vector multiplication is described as being included in part of matrix multiplication. For matrix-vector products that are not included in the matrix-product operation, it is similar to the matrix-vector product vc1=A×vb1 for the first columns of matrix B and matrix C, and so on. In this embodiment, the matrices A, B, and C have power-of-2 elements in the row and column directions, and the matrix A is divided into power-of-2 elements in the row and column directions. An example is given for the case. Alternatively, matrices A, B, and C may have a number of elements other than a power of 2 in at least one of the row-wise or column-wise directions, and matrix A may have at least Each may be divided into numbers other than powers of 2 (eg, 3×3, 5×5, 9×9, 3×5, 5×9, etc.).

図２は、本実施形態に係る行列演算を、部分行列および部分ベクトルの行列ベクトル積に分解した計算式の一例を示す。行列Ａおよびベクトルｖｂｊの行列ベクトル積ｖｃｊ＝Ａ×ｖｂｊは、部分ベクトルｖｃ１ｊを計算するｖｃ１ｊ＝（Ａ１１Ａ１２）×ｖｂｊ＝Ａ１１×ｖｂ１ｊ＋Ａ１２×ｖｂ２ｊと、部分ベクトルｖｃ２ｊを計算するｖｃ２ｊ＝（Ａ２１Ａ２２）×ｖｂ_ｊ＝Ａ２１×ｖｂ１ｊ＋Ａ２２×ｖｂ２ｊとに分けることができる。すなわち、ｊ＝１の場合、ｖｃ１１＝Ａ１１×ｖｂ１１＋Ａ１２×ｖｂ２１、ｖｃ２１＝Ａ２１×ｖｂ１１＋Ａ２２×ｖｂ２１となる。また、ｊ＝２の場合、ｖｃ１２＝Ａ１１×ｖｂ１２＋Ａ１２×ｖｂ２２、ｖｃ２２＝Ａ２１×ｖｂ１２＋Ａ２２×ｖｂ２２となる。以下、ｊ＝３，…，８も同様である。 FIG. 2 shows an example of a calculation formula in which the matrix operation according to this embodiment is decomposed into a submatrix and a matrix-vector product of the subvectors. The matrix-vector product vcj=A*vbj of matrix A and vector vbj is vc1j=(A11 A12)*vbj=A11*vb1j+A12*vb2j to compute subvector vc1j and vc2j=(A21 A22) to compute subvector vc2j. *vb _j =A21*vb1j+A22*vb2j. That is, when j=1, vc11=A11*vb11+A12*vb21 and vc21=A21*vb11+A22*vb21. When j=2, vc12=A11*vb12+A12*vb22 and vc22=A21*vb12+A22*vb22. The same applies to j=3, . . . , 8 below.

このように、行列Ａを行方向および列方向にそれぞれｄ個に分割し、ベクトルｖｂをｄ個に分割すると、行列Aおよびベクトルｖｂの行列ベクトル積は、部分行列および部分ベクトルの行列ベクトル積をｄ×ｄ個含むものとなる。演算装置が単一の部分行列を格納可能なレジスタしか有しない場合、演算装置は、部分行列をメモリからレジスタに順次ロードしながら図２に示した行列演算を行うこととなり、処理性能が低下してしまう。 In this way, when the matrix A is divided into d pieces in the row direction and the column direction, and the vector vb is divided into d pieces, the matrix-vector product of the matrix A and the vector vb is the matrix-vector product of the submatrix and the subvector. d×d pieces are included. If the arithmetic unit has only a register capable of storing a single submatrix, the arithmetic unit performs the matrix operations shown in FIG. end up

図３は、本実施形態に係る演算装置３００の構成を示す。演算装置３００は、仕様上定められた行数および列数までの行列と、仕様上定められた行数までのベクトルとの行列ベクトル積を１単位の演算としてパイプライン演算により実行可能である。演算装置３００は、１単位の演算で処理可能なサイズよりも大きい行列およびベクトルの行列ベクトル積を、１単位の演算で処理可能な部分行列および部分ベクトルの行列ベクトル積の複数組に分割して計算する。 FIG. 3 shows the configuration of an arithmetic device 300 according to this embodiment. Arithmetic device 300 can execute a matrix-vector product of a matrix up to the number of rows and columns specified in the specification and a vector up to the number of rows specified in the specification as one unit of operation by pipeline operation. Arithmetic unit 300 divides a matrix-vector product of a matrix and a vector larger than the size that can be processed in one unit of computation into a plurality of sets of submatrices and matrix-vector products of the partial vectors that can be processed in one unit of computation. calculate.

演算装置３００は、ベクトル記憶部３１０と、行列記憶部３２０と、パイプライン演算部３３０と、結果記憶部３４０と、演算制御部３５０と、メインメモリ３６０と、メモリ制御部３７０とを備える。ベクトル記憶部３１０は、第１ベクトルを分割した第１の複数の部分ベクトルのうち、第１部分ベクトルを少なくとも記憶する。本実施形態において、ベクトル記憶部３１０は、一例としてレジスタである。これに代えて、ベクトル記憶部３１０は、キャッシュメモリ等の、パイプライン的に部分ベクトルをパイプライン演算部３３０に供給できる他の記憶装置であってもよい。 Arithmetic device 300 includes vector storage unit 310 , matrix storage unit 320 , pipeline operation unit 330 , result storage unit 340 , operation control unit 350 , main memory 360 , and memory control unit 370 . Vector storage unit 310 stores at least the first partial vector among the first plurality of partial vectors obtained by dividing the first vector. In this embodiment, the vector storage unit 310 is a register as an example. Alternatively, the vector storage unit 310 may be another storage device such as a cache memory that can pipeline partial vectors to the pipeline operation unit 330 .

ここで、第１ベクトルは、少なくとも１つの部分行列が行列記憶部３２０に記憶された１行列を乗じる対象となる対象ベクトルである。第１ベクトルは、演算装置３００が１単位の演算で処理可能なサイズよりも大きい。第１の複数の部分ベクトルは、第１ベクトルを、１単位の演算で処理可能な大きさに分割したものである。図１の行列演算において、第１ベクトルは、ベクトルｖｂｊのいずれか（例えばｖｂ１）に相当する。第１の複数の部分ベクトルは、第１ベクトルｖｂｊを分割して得られる部分ベクトルｖｂｉｊ（ｉ＝１，２）に相当する。第１ベクトルがさらに大きい場合、第１ベクトルは、３以上の部分ベクトルに分割されてもよい。 Here, the first vector is a target vector for which at least one submatrix is multiplied by one matrix stored in the matrix storage unit 320 . The first vector is larger than the size that can be processed by the arithmetic unit 300 in one unit of arithmetic. The first plurality of partial vectors are obtained by dividing the first vector into sizes that can be processed in one unit of operation. In the matrix operation of FIG. 1, the first vector corresponds to one of the vectors vbj (eg vb1). The first plurality of partial vectors correspond to partial vectors vbij (i=1, 2) obtained by dividing the first vector vbj. If the first vector is larger, the first vector may be split into three or more partial vectors.

また、ベクトル記憶部３１０は、第１の複数の部分ベクトルのうち、第２部分ベクトル、およびその他の部分ベクトルを更に記憶するべく、十分な記憶領域を有してもよい。例えば、図１の行列演算において、ベクトル記憶部３１０は、部分ベクトルｖｂ１ｊおよび部分ベクトルｖｂ２ｊを記憶してもよい。 Moreover, the vector storage unit 310 may have a sufficient storage area to further store the second partial vector and other partial vectors among the first plurality of partial vectors. For example, in the matrix operation of FIG. 1, vector storage section 310 may store partial vector vb1j and partial vector vb2j.

行列記憶部３２０は、第１ベクトルに乗じる第１行列を行方向および列方向に分割した第１の複数の部分行列のうち、第１部分ベクトルに乗じるべき第１部分行列を少なくとも記憶する。本実施形態において、行列記憶部３２０は、一例としてレジスタである。これに代えて、行列記憶部３２０は、キャッシュメモリ等の、パイプライン的に部分行列をパイプライン演算部３３０に供給できる他の記憶装置であってもよい。 Matrix storage unit 320 stores at least the first submatrix by which the first vector is to be multiplied, among a plurality of first submatrices obtained by dividing the first matrix by which the first vector is to be multiplied in the row direction and the column direction. In this embodiment, the matrix storage unit 320 is a register as an example. Alternatively, the matrix storage unit 320 may be another storage device such as a cache memory that can pipeline submatrices to the pipeline operation unit 330 .

ここで、第１行列は、少なくとも１つの部分ベクトルがベクトル記憶部３１０に記憶された第１ベクトルに乗じる対象となる対象行列である。第１行列は、演算装置３００が１単位の演算で処理可能なサイズよりも大きい。第１の複数の部分行列は、第１行列を、演算装置３００が１単位の演算で処理可能な大きさに分割したものである。図１の行列演算において、第１行列は、行列Ａに相当する。第１の複数の部分行列は、第１行列Ａを行方向および列方向に分割して得られる部分行列Ａｉｊ（ｉ＝１，２、ｊ＝１，２）に相当する。第１行列がさらに大きい場合、第１行列は、行方向および列方向のそれぞれにおいて３以上に分割されてもよい。 Here, the first matrix is a target matrix by which at least one partial vector is multiplied by the first vector stored in the vector storage unit 310 . The first matrix is larger than the size that can be processed by the arithmetic unit 300 in one unit of arithmetic. The first plurality of submatrices are obtained by dividing the first matrix into sizes that can be processed by the arithmetic unit 300 in one unit of arithmetic. The first matrix corresponds to the matrix A in the matrix operation of FIG. The first plurality of submatrices correspond to submatrices Aij (i=1, 2, j=1, 2) obtained by dividing the first matrix A in the row direction and the column direction. If the first matrix is even larger, the first matrix may be divided into three or more in each of the row direction and column direction.

また、行列記憶部３２０は、第１の複数の部分行列のうち、第２部分ベクトルに乗じるべき第２部分行列、およびその他の部分行列を更に記憶するべく、十分な記憶領域を有してもよい。例えば、図１の行列演算において、行列記憶部３２０は、部分ベクトルｖｂ１ｊに乗じるべき部分行列Ａ１１と、部分ベクトルｖｂ２ｊに乗じるべき部分行列Ａ１２を記憶してもよい。ここで、第１部分ベクトルおよび第２部分ベクトルは、第１ベクトルにおける異なる行範囲に位置する。このため、第１部分行列および第２部分行列は、対象行列における異なる列範囲に位置する。なお、第１部分行列および第２部分行列は、対象行列における同じ行範囲に位置してよい。 Further, the matrix storage unit 320 may have a sufficient storage area to further store the second partial matrix by which the second partial vector is to be multiplied and other partial matrices among the first plurality of partial matrices. good. For example, in the matrix operation of FIG. 1, the matrix storage unit 320 may store a submatrix A11 by which the subvector vb1j should be multiplied and a submatrix A12 by which the subvector vb2j should be multiplied. Here, the first partial vector and the second partial vector are located in different row ranges in the first vector. Therefore, the first submatrix and the second submatrix are located in different column ranges in the target matrix. Note that the first submatrix and the second submatrix may be located in the same row range in the target matrix.

パイプライン演算部３３０は、ベクトル記憶部３１０および行列記憶部３２０に接続され、ベクトル記憶部３１０に記憶された演算対象の部分ベクトルをベクトル記憶部３１０から受け取り、行列記憶部３２０に記憶された演算対象の部分行列を行列記憶部３２０から受け取る。パイプライン演算部３３０は、パイプライン演算により、演算対象の部分行列および部分ベクトルの行列ベクトル積に、中間ベクトルを加える演算を実行可能である。本実施形態において、パイプライン演算部３３０は、４行４列の部分行列と４行の部分ベクトルとの行列ベクトル積を算出し、この行列ベクトル積に４行の中間ベクトルを加えて演算結果となる部分ベクトル（「結果ベクトル」とも示す。）を算出する演算を１単位の演算として実行可能である。 Pipeline operation unit 330 is connected to vector storage unit 310 and matrix storage unit 320, receives from vector storage unit 310 the partial vector to be operated on stored in vector storage unit 310, and performs the operation stored in matrix storage unit 320. A submatrix of interest is received from the matrix storage unit 320 . The pipeline operation unit 330 can perform an operation of adding an intermediate vector to the matrix-vector product of the submatrix and the partial vector to be operated by the pipeline operation. In this embodiment, the pipeline operation unit 330 calculates a matrix-vector product of a 4-row, 4-column submatrix and a 4-row partial vector, and adds a 4-row intermediate vector to the matrix-vector product to obtain the operation result. An operation for calculating a partial vector (also referred to as a "result vector") can be executed as a unit operation.

ここで、１単位の演算として実行可能とは、パイプライン演算部３３０が、例えば外部からの指示または命令の実行等の要求に応じて、演算対象の部分行列および部分ベクトルの行列ベクトル積に、中間ベクトルを加える演算をまとめて実行し、その結果を出力することを意味する。パイプライン演算部３３０は、この演算に含まれる全ての基本演算（例えば、値同士の乗算、加算）を別個の演算器で行うべく多数の演算器を有してもよく、これに代えて一部の演算を同じ演算器で行ってもよい。 Here, "executable as one unit of operation" means that the pipeline operation unit 330, in response to an instruction from the outside or a request such as the execution of an instruction, converts the matrix-vector product of the partial matrix and the partial vector to be operated into It means to collectively execute the operation of adding intermediate vectors and output the result. The pipeline operation unit 330 may have a large number of arithmetic units so that all basic operations included in this operation (for example, multiplication and addition of values) are performed by separate arithmetic units. may be performed by the same calculator.

また、パイプライン演算部３３０がパイプラン演算を行うとは、パイプライン演算部３３０が演算の開始後複数のステージにおける処理を経て結果を出力するところ、各ステージは並列に動作可能であることを意味する。すなわち、パイプライン演算部３３０は、ある演算の開始後結果を出力するまでの各サイクルにおいて、特に実行上の障害がなければ順次他の演算を開始することができる。 Further, when the pipeline operation unit 330 performs the pipeline operation, it means that the pipeline operation unit 330 outputs the result through processing in a plurality of stages after starting the operation, and each stage can operate in parallel. do. That is, pipeline operation unit 330 can sequentially start other operations in each cycle from the start of a certain operation to the output of the result if there is no particular obstacle in execution.

例えば、パイプライン演算部３３０は、１サイクル目に、部分行列および部分ベクトルを入力し、２サイクル目に、部分行列および部分ベクトルの対応する要素同士を乗算し、３サイクル目に、結果ベクトルに含まれるべき各要素について２サイクル目に計算した積を合計し、４サイクル目に、演算結果の部分ベクトルを出力してもよい。パイプライン演算部３３０は、必要に応じて任意の段数のパイプライン構造をとることができる。 For example, the pipeline operation unit 330 inputs a submatrix and a partial vector in the first cycle, multiplies corresponding elements of the submatrix and the partial vector in the second cycle, and converts the result vector in the third cycle. The products calculated in the second cycle for each element to be included may be summed, and in the fourth cycle, a partial vector of the operation result may be output. The pipeline operation unit 330 can have a pipeline structure with any number of stages as required.

結果記憶部３４０は、パイプライン演算部３３０に接続される。結果記憶部３４０は、パイプライン演算部３３０が出力する結果ベクトルを受け取って、格納する。結果ベクトルは、例えば図２におけるｖｃ１１およびｖｃ２１等である。本実施形態において、結果記憶部３４０は、一例としてレジスタである。これに代えて、結果記憶部３４０は、キャッシュメモリ等の、パイプライン的にパイプライン演算部３３０からの部分ベクトルを格納できる他の記憶装置であってもよい。なお、ベクトル記憶部３１０、行列記憶部３２０、および結果記憶部３４０は、同一の記憶装置として実装されてもよい。 Result storage unit 340 is connected to pipeline operation unit 330 . The result storage unit 340 receives and stores the result vector output by the pipeline operation unit 330 . The resulting vectors are, for example, vc11 and vc21 in FIG. In this embodiment, the result storage unit 340 is a register as an example. Alternatively, the result storage unit 340 may be another storage device such as a cache memory that can store partial vectors from the pipeline operation unit 330 in a pipeline manner. Vector storage unit 310, matrix storage unit 320, and result storage unit 340 may be implemented as the same storage device.

演算制御部３５０は、ベクトル記憶部３１０、行列記憶部３２０、パイプライン演算部３３０、および結果記憶部３４０に接続される。演算制御部３５０は、例えば演算装置３００の外部からの指示を受けたこと、または演算装置３００におけるプログラム実行中に行列演算命令をデコードしたこと等のような行列演算の実行要求に応じて、要求された行列演算を実行するべくベクトル記憶部３１０、行列記憶部３２０、パイプライン演算部３３０、および結果記憶部３４０を制御する。 Operation control unit 350 is connected to vector storage unit 310 , matrix storage unit 320 , pipeline operation unit 330 , and result storage unit 340 . The arithmetic control unit 350 receives an instruction from outside the arithmetic device 300, or decodes a matrix arithmetic instruction during program execution in the arithmetic device 300. In response to the execution request of the matrix arithmetic, the request It controls the vector storage unit 310, the matrix storage unit 320, the pipeline operation unit 330, and the result storage unit 340 in order to execute the calculated matrix operation.

メインメモリ３６０は、行列演算の対象となる行列および演算結果を格納する。メモリ制御部３７０は、ベクトル記憶部３１０、行列記憶部３２０、および結果記憶部３４０と、メインメモリ３６０との間に接続される。メモリ制御部３７０は、外部からの指示、または演算装置３００におけるプログラム実行中のメモリアクセス命令に応じて、ベクトル記憶部３１０、行列記憶部３２０、および結果記憶部３４０と、メインメモリ３６０との間のデータ転送を行う。 The main memory 360 stores matrices to be subjected to matrix calculations and calculation results. Memory control unit 370 is connected between vector storage unit 310 , matrix storage unit 320 , result storage unit 340 and main memory 360 . Memory control unit 370 operates between vector storage unit 310, matrix storage unit 320, and result storage unit 340 and main memory 360 in response to an external instruction or a memory access instruction during program execution in arithmetic unit 300. data transfer.

例えば、メモリ制御部３７０は、メインメモリ３６０からベクトル記憶部３１０へのベクトルロードが要求されたことに応じて、メインメモリ３６０に記憶された部分ベクトルのうちベクトルロードによって指定された部分ベクトルをメインメモリ３６０から読み出して、ベクトル記憶部３１０へと格納する。また、メモリ制御部３７０は、メインメモリ３６０から行列記憶部３２０への行列ロードが要求されたことに応じて、メインメモリ３６０に記憶された部分行列のうち行列ロードによって指定された部分行列をメインメモリ３６０から読み出して、行列記憶部３２０へと格納する。また、メモリ制御部３７０は、結果記憶部３４０からメインメモリ３６０への行列またはベクトルストアが要求されたことに応じて、結果記憶部３４０に記憶された、演算結果の行列またはベクトルを読み出して、メインメモリ３６０へと格納する。なお、演算装置３００の設計によっては、ベクトル記憶部３１０および行列記憶部３２０に加えてメインメモリ３６０を設けず、ベクトル記憶部３１０および行列記憶部３２０として機能する比較的大きいメモリを設けて、当該メモリから直接パイプライン的にパイプライン演算部３３０に部分ベクトルおよび部分行列を供給できるようにしてもよい。 For example, in response to a vector load request from the main memory 360 to the vector storage unit 310, the memory control unit 370 loads the partial vector specified by the vector load out of the partial vectors stored in the main memory 360 into the main memory. It reads out from the memory 360 and stores it in the vector storage unit 310 . In addition, in response to a request to load the matrix from the main memory 360 to the matrix storage unit 320, the memory control unit 370 loads the submatrix specified by the matrix load out of the submatrices stored in the main memory 360 into the main memory. It reads out from the memory 360 and stores it in the matrix storage unit 320 . In addition, in response to a request from the result storage unit 340 to store the matrix or vector in the main memory 360, the memory control unit 370 reads out the matrix or vector of the operation result stored in the result storage unit 340, Store in main memory 360 . Depending on the design of the arithmetic unit 300, the main memory 360 may not be provided in addition to the vector storage unit 310 and the matrix storage unit 320, and a relatively large memory functioning as the vector storage unit 310 and the matrix storage unit 320 may be provided. Partial vectors and partial matrices may be supplied to the pipeline operation unit 330 in a direct pipeline manner from the memory.

以上に示した構成において、パイプライン演算部３３０は、パイプライン処理により、行列ベクトル積の演算を実行する。例えば図２に示したｖｃ１１＝Ａ１１×ｖｂ１１＋Ａ１２×ｖｂ２１の演算を行う場合には、パイプライン演算部３３０は、第１部分行列Ａ１１および第１部分ベクトルｖｂ１１の演算を開始した後に演算結果を得るまでに、複数サイクルを要する。このため、パイプライン演算部３３０は、第１部分行列Ａ１１および第１部分ベクトルｖｂ１１の行列ベクトル積を計算する第１演算を開始したサイクルの次のサイクルに、第２部分行列Ａ１２および第２部分ベクトルｖｂ２１の行列ベクトル積を第１演算の結果に加える第２演算が投入されたとしても、第２演算の実行に障害が生じ（パイプラインハザード）、第１演算の演算結果が利用可能となるまで第２演算の処理を待たせる必要が生じてしまう。 In the configuration described above, the pipeline operation unit 330 executes matrix-vector multiplication operations through pipeline processing. For example, when performing the calculation of vc11=A11×vb11+A12×vb21 shown in FIG. requires multiple cycles. Therefore, in the cycle following the cycle in which the first operation for calculating the matrix-vector product of the first partial matrix A11 and the first partial vector vb11 is started, the pipeline operation unit 330 performs Even if a second operation is introduced that adds the matrix-vector product of vector vb21 to the result of the first operation, the execution of the second operation fails (pipeline hazard) and the result of the first operation becomes available. It becomes necessary to make the processing of the second operation wait until .

なお、パイプラインの設計によっては、第１演算の結果をレジスタに書き込むのを待たずに第２演算へと供給する（バイパスまたはフォワーディング）等により、第２演算の処理待ちをある程度は削減することができる。しかし、第１演算および第２演算の間に依存関係がある以上、パイプラインハザードによってパイプライン演算部３３０のパイプラインに生じる空きを完全になくすことは難しい。 Depending on the design of the pipeline, the waiting time for processing the second operation may be reduced to some extent by supplying the result of the first operation to the second operation without waiting for it to be written to the register (bypass or forwarding). can be done. However, as long as there is a dependency between the first operation and the second operation, it is difficult to completely eliminate the vacancies that arise in the pipeline of pipeline operation section 330 due to pipeline hazards.

そこで、演算制御部３５０は、パイプライン演算部３３０が、第１部分行列および第１部分ベクトルの行列ベクトル積（例えばＡ１１×ｖｂ１１）のパイプライン演算中に、第１部分ベクトルまたは第１部分行列を用いた他の行列ベクトル積の演算の実行をパイプライン演算部３３０に指示する。ここで「他の行列ベクトル積」は、第１部分行列および第１部分ベクトルの行列ベクトル積の演算結果を使用しない演算であり、行列ベクトル積を含む演算、すなわち例えば行列ベクトル積に第１部分行列および第１部分ベクトルの行列ベクトル積以外の演算結果を加えるような演算であってもよい。これにより、演算制御部３５０は、パイプライン演算部３３０が第１演算の演算結果を待ってから第２演算を実行開始するまでの間に、第１演算の演算結果に依存しない１または複数の他の行列ベクトル積をパイプライン演算部３３０へと投入し、これによってパイプライン演算部３３０の利用効率を高めることができる。 Therefore, the operation control unit 350 causes the pipeline operation unit 330 to perform the pipeline operation of the matrix-vector product (for example, A11×vb11) of the first submatrix and the first partial vector. The pipeline operation unit 330 is instructed to perform another matrix-vector product operation using . Here, "another matrix-vector product" is an operation that does not use the result of the operation of the matrix-vector product of the first submatrix and the first partial vector, and includes an operation that includes a matrix-vector product, i.e., for example, the matrix-vector product has the first part The calculation may be such that a calculation result other than the matrix-vector product of the matrix and the first partial vector is added. As a result, the operation control unit 350 executes one or more operations that do not depend on the operation result of the first operation after the pipeline operation unit 330 waits for the operation result of the first operation and before it starts executing the second operation. Other matrix-vector products can be injected into the pipeline operation unit 330, thereby increasing the utilization efficiency of the pipeline operation unit 330. FIG.

さらに、演算制御部３５０は、第１部分行列および第１部分ベクトルの行列ベクトル積の演算結果が遅延なく利用可能となるサイクル以降に、第２部分行列および第２部分ベクトルの行列ベクトル積を、第１部分行列および第１部分ベクトルの行列ベクトル積の演算結果に加える演算の実行をパイプライン演算部３３０に指示してもよい。これにより、演算制御部３５０は、第２演算にパイプラインハザードが生じるのを防ぐことができ、第１演算および第２演算の間に他の行列ベクトル積の演算を投入可能とすることができる。 Further, after the cycle in which the calculation result of the matrix-vector product of the first partial matrix and the first partial vector becomes available without delay, the operation control unit 350 calculates the matrix-vector product of the second partial matrix and the second partial vector as follows: The pipeline operation unit 330 may be instructed to perform an operation to be added to the operation result of the matrix-vector product of the first partial matrix and the first partial vector. As a result, the operation control unit 350 can prevent pipeline hazards from occurring in the second operation, and can insert another matrix-vector product operation between the first operation and the second operation. .

図４は、本実施形態に係る演算装置３００によるパイプライン処理の第１例を示す。サイクル０と示した演算において、演算制御部３５０は、第１部分ベクトルの一例であるｖｂ１１の読み出しをベクトル記憶部３１０に指示し、第１部分行列の一例であるＡ１１の読み出しを行列記憶部３２０に指示するとともに、第１部分行列Ａ１１および第１部分ベクトルｖｂ１１の行列ベクトル積を計算し、計算途中の中間ベクトルとしてパイプライン演算部３３０が有する中間レジスタ（テンポラリレジスタ）ｖｃｔｍｐ１に格納する演算の実行をパイプライン演算部３３０に指示する。 FIG. 4 shows a first example of pipeline processing by the arithmetic device 300 according to this embodiment. In the calculation shown as cycle 0, the calculation control unit 350 instructs the vector storage unit 310 to read vb11, which is an example of the first partial vector, and reads A11, which is an example of the first partial matrix, to the matrix storage unit 320. , the matrix-vector product of the first partial matrix A11 and the first partial vector vb11 is calculated, and the intermediate vector in the middle of the calculation is stored in the intermediate register (temporary register) vctmp1 of the pipeline operation unit 330. to the pipeline operation unit 330 .

サイクル１の実行開始までに、ベクトル記憶部３１０は、第１部分ベクトルの一例であるｖｂ１１および第２部分ベクトルの一例であるｖｂ２１に加えて、第１行列Ａを乗じるべき第２ベクトル（一例としてｖｂ２）を分割した第２の複数の部分ベクトルｖｂｉ２のうち、第１部分行列Ａ１１を乗じるべき第３部分ベクトル（一例としてｖｂ１２）を更に記憶する。本例において、第１ベクトルおよび第２ベクトルは、第１行列Ａに乗じるべき第２行列Ｂに含まれる列ベクトルであり、例えば第１ベクトルはｖｂ１、第２ベクトルはｖｂ２である。第３部分ベクトルは、第２ベクトルｖｂ２を分割した第２の複数の部分ベクトルｖｂｉ２のうち第１部分行列Ａ１１を乗じるべきｖｂ１２である。これに代えて、第１ベクトルおよび第２ベクトルは、それぞれ行列Ａを乗じるべき別個のベクトルであってもよい。 By the start of execution of cycle 1, the vector storage unit 310 stores vb11, which is an example of the first partial vector, and vb21, which is an example of the second partial vector, as well as a second vector to be multiplied by the first matrix A (as an example, A third partial vector (vb12 as an example) to be multiplied by the first submatrix A11 is further stored among the second plurality of partial vectors vbi2 obtained by dividing vb2). In this example, the first vector and the second vector are column vectors contained in the second matrix B to be multiplied by the first matrix A, for example, the first vector is vb1 and the second vector is vb2. The third partial vector is vb12 to be multiplied by the first partial matrix A11 among the second plurality of partial vectors vbi2 obtained by dividing the second vector vb2. Alternatively, the first vector and the second vector may each be separate vectors to be multiplied by the matrix A.

サイクル１と示した演算において、第１部分行列および第１部分ベクトルの行列ベクトル積のパイプライン演算中に、演算制御部３５０は、第３部分ベクトルｖｂ１２の読み出しをベクトル記憶部３１０に指示し、第１部分行列Ａ１１の読み出しを行列記憶部３２０に指示するとともに、パイプラインハザードを生じさせない他の行列ベクトル積の演算として、第１部分行列および第３部分ベクトルの行列ベクトル積の演算の実行をパイプライン演算部３３０に指示する。これを受けて、パイプライン演算部３３０は、第１部分行列および第３部分ベクトルの行列ベクトル積を、計算途中の中間ベクトルとしてパイプライン演算部３３０が有する中間レジスタｖｃｔｍｐ２に格納する演算を実行する。この演算は、図２の第３行目における１つ目の行列ベクトル積の演算に相当し、サイクル０および１の行列ベクトル積は、互いに異なる結果ベクトルｖｃ１１およびｖｃ１２に反映されるものである。したがって、これらの演算の間に依存関係はないから、パイプライン演算部３３０は、これらの演算を、パイプラインハザードを発生させることなく実行することができる。 In the operation shown as cycle 1, during the pipeline operation of the matrix-vector product of the first partial matrix and the first partial vector, the operation control unit 350 instructs the vector storage unit 310 to read the third partial vector vb12, In addition to instructing the matrix storage unit 320 to read out the first submatrix A11, as another matrix-vector product operation that does not cause a pipeline hazard, execute the matrix-vector product operation of the first submatrix and the third partial vector. The pipeline calculation unit 330 is instructed. In response to this, pipeline operation section 330 executes an operation of storing the matrix-vector product of the first partial matrix and the third partial vector in intermediate register vctmp2 of pipeline operation section 330 as an intermediate vector during calculation. . This operation corresponds to the first matrix-vector product operation in the third row of FIG. 2, and the matrix-vector products of cycles 0 and 1 are reflected in different result vectors vc11 and vc12. Therefore, since there is no dependency between these operations, the pipeline operation unit 330 can execute these operations without causing pipeline hazards.

サイクル２と示した演算において、演算制御部３５０は、第２部分ベクトルの一例であるｖｂ２１の読み出しをベクトル記憶部３１０に指示し、第２部分行列の一例であるＡ１２の読み出しを行列記憶部３２０に指示するとともに、第２部分行列Ａ１２および第２部分ベクトルｖｂ２１の行列ベクトル積を計算し、サイクル０の演算の演算結果ｖｃｔｍｐ１を加える演算の実行をパイプライン演算部３３０に指示し、演算の結果得られる部分ベクトルｖｃ１１を格納することを結果記憶部３５０に指示する。ここで、サイクル２の演算はサイクル０の演算に依存するところ、演算制御部３５０は、サイクル０の演算に依存しないサイクル１の演算をサイクル０およびサイクル２の演算の間に挿入することで、パイプライン演算部３３０のパイプラインの利用効率を上げることができる。 In the calculation shown as cycle 2, the calculation control unit 350 instructs the vector storage unit 310 to read vb21, which is an example of the second partial vector, and reads A12, which is an example of the second partial matrix, to the matrix storage unit 320. , and instructs the pipeline operation unit 330 to perform the operation of calculating the matrix-vector product of the second partial matrix A12 and the second partial vector vb21 and adding the operation result vctmp1 of the operation in cycle 0, and the operation result The result storage unit 350 is instructed to store the obtained partial vector vc11. Here, since the calculation of cycle 2 depends on the calculation of cycle 0, the calculation control unit 350 inserts the calculation of cycle 1, which does not depend on the calculation of cycle 0, between the calculations of cycle 0 and cycle 2. The utilization efficiency of the pipeline of the pipeline operation unit 330 can be increased.

サイクル３の実行開始までに、ベクトル記憶部３１０は、第２の複数の部分ベクトルのうち、第２部分行列Ａ１２を乗じるべき第４部分ベクトルを更に記憶してよい。サイクル３と示した演算において、演算制御部３５０は、第４部分ベクトルの一例であるｖｂ２２の読み出しをベクトル記憶部３１０に指示し、第２部分行列の一例であるＡ１２の読み出しを行列記憶部３２０に指示するとともに、第２部分行列Ａ１２および第４部分ベクトルｖｂ２２の行列ベクトル積を計算し、サイクル１の演算の演算結果ｖｃｔｍｐ２を加える演算の実行をパイプライン演算部３３０に指示し、演算の結果得られる部分ベクトルｖｃ１２を格納することをメインメモリ３６０に指示する。ここで、サイクル３の演算はサイクル１の演算に依存するところ、演算制御部３５０は、サイクル１の演算に依存しないサイクル２の演算をサイクル１およびサイクル３の演算の間に挿入することで、パイプライン演算部３３０のパイプラインの利用効率を上げることができる。 By the start of execution of cycle 3, the vector storage unit 310 may further store a fourth partial vector to be multiplied by the second partial matrix A12 among the second plurality of partial vectors. In the calculation shown as cycle 3, the calculation control unit 350 instructs the vector storage unit 310 to read vb22, which is an example of the fourth partial vector, and reads A12, which is an example of the second partial vector, to the matrix storage unit 320. , and instructs the pipeline operation unit 330 to perform the operation of calculating the matrix-vector product of the second partial matrix A12 and the fourth partial vector vb22 and adding the operation result vctmp2 of the operation in cycle 1, and the operation result The main memory 360 is instructed to store the resulting partial vector vc12. Here, since the calculation of cycle 3 depends on the calculation of cycle 1, the calculation control unit 350 inserts the calculation of cycle 2, which does not depend on the calculation of cycle 1, between the calculations of cycle 1 and cycle 3. The utilization efficiency of the pipeline of the pipeline operation unit 330 can be increased.

本図の例では、サイクル０～３において行列Ｃの複数の列ベクトル（ｖｃ１、ｖｃ２）における第１行範囲（第１～４行）の部分ベクトル（ｖｃ１１、ｖｃ１２）を計算し、サイクル４～７において行列Ｃの複数の列ベクトル（ｖｃ１、ｖｃ２）における第２行範囲（第５～８行）の部分ベクトル（ｖｃ２１、ｖｃ２２）を計算する。サイクル４～７の演算は、部分行列Ａ１１、Ａ１２に代えて部分行列Ａ２１、Ａ２２を用い、部分ベクトルｖｃ１１、ｖｃ１２に代えて部分ベクトルｖｃ２１、ｖｃ２２を用いる他は同様であるので説明を省略する。 In the example of this figure, the partial vectors (vc11, vc12) of the first row range (1st to 4th rows) in the multiple column vectors (vc1, vc2) of the matrix C are calculated in cycles 0 to 3. 7, the subvectors (vc21, vc22) of the second row range (5th to 8th rows) in the column vectors (vc1, vc2) of the matrix C are calculated. The calculations in cycles 4 to 7 are the same except that submatrices A21 and A22 are used in place of submatrices A11 and A12, and subvectors vc21 and vc22 are used in place of subvectors vc11 and vc12.

本例において、演算制御部３５０は、第１部分行列および第１部分ベクトルの行列ベクトル積の第１演算と、その演算結果を利用する第２演算との間に、第１部分行列を用いた他の行列ベクトル積の演算、すなわち本例においては第１部分行列および第３部分ベクトルの行列ベクトル積の演算、を挿入する。これによって、演算制御部３５０は、第１演算および第２演算の間に必要となる空きサイクルを１つ利用することができる。 In this example, the operation control unit 350 uses the first submatrix between the first operation of the matrix-vector product of the first submatrix and the first partial vector and the second operation using the result of the operation. Insert another matrix-vector product operation, in this example the matrix-vector product of the first sub-matrix and the third sub-vector. As a result, the arithmetic control unit 350 can utilize one empty cycle required between the first arithmetic operation and the second arithmetic operation.

第１演算および第２演算の間に複数の空きサイクルが生じる場合、演算制御部３５０は、第１部分行列および複数の第３部分ベクトルのそれぞれの行列ベクトル積を第１演算および第２演算の間に挿入してよい。例えば、ベクトル記憶部３１０は、第２行列Ｂに含まれる複数の第２ベクトルｖｂ２、ｖｂ３、…を更に記憶しておく。演算制御部３５０は、第１部分行列Ａ１１および第１部分ベクトルｖｂ１１の行列ベクトル積のパイプライン演算の開始後から演算結果が遅滞なく利用可能となる前までの間の各サイクルを、第１部分行列Ａ１１および複数の第２ベクトルｖｂ２、ｖｂ３、…のそれぞれからの第３部分ベクトルｖｂ１２、ｖｂ１３、…の行列ベクトル積Ａ１１×ｖｂ１２、Ａ１１×ｖｂ１３、…の演算で充填する。なお、第１ベクトルおよび複数の第２ベクトルは、第２行列の列順または列順の逆順に並んでいてもよく、また第２行列の列順に並んでおらず、それぞれ任意の列の列ベクトルであってよい。 When a plurality of empty cycles occur between the first computation and the second computation, computation control section 350 calculates the matrix-vector product of each of the first submatrix and the plurality of third subvectors in the first computation and the second computation. can be inserted in between. For example, the vector storage unit 310 further stores a plurality of second vectors vb2, vb3, . Operation control section 350 controls each cycle from the start of the pipeline operation of the matrix-vector product of first submatrix A11 and first subvector vb11 to before the operation result becomes available without delay as the first part Matrix-vector products A11×vb12, A11×vb13, . . . of matrix A11 and third partial vectors vb12, vb13, . The first vector and the plurality of second vectors may be arranged in the column order of the second matrix or in the reverse order of the columns. can be

図５は、本実施形態に係る演算装置３００によるパイプライン処理の第２例を示す。パイプライン演算部３３０がより多くの中間レジスタを有する場合、または演算結果を一旦メインメモリ３６０に格納した後に利用可能となる場合等においては、演算制御部３５０は、図４におけるサイクル４～５の演算を、サイクル２～３の演算の前に行うように制御してもよい。この場合、演算制御部３５０は、第１部分行列Ａ１１および第１部分ベクトルｖｂ１１の行列ベクトル積を演算する第１演算のパイプライン演算の実行中に、第１部分行列を用いた他の行列ベクトル積の演算である図５中のサイクル１の演算と、第１部分ベクトルを用いた他の行列ベクトル積の演算である図５中のサイクル２の演算とをパイプライン演算部３３０に実行させる。また、演算制御部３５０は、サイクル２の演算に用いた第２部分行列Ａ２１と、サイクル１の演算に用いた第２部分ベクトルｖｂ１２との行列ベクトル積の演算であるサイクル３の演算を、第１演算および第２演算の間に実行させてよい。これにより、演算制御部３５０は、第１演算および第２演算の間の空きサイクルを更に充填することが可能となる。なお、サイクル０～３の演算同士の実行順序は任意であってよく、サイクル４～７の演算同士の実行順序はサイクル０～３における対応する演算の実行順序に応じて決定されてよい。 FIG. 5 shows a second example of pipeline processing by the arithmetic device 300 according to this embodiment. When pipeline operation unit 330 has more intermediate registers, or when operation results become available after being temporarily stored in main memory 360, operation control unit 350 performs The operations may be controlled to occur before the operations of cycles 2-3. In this case, the operation control unit 350 performs another matrix vector Cycle 1 in FIG. 5, which is a product operation, and Cycle 2 in FIG. 5, which is another matrix-vector product operation using the first partial vector, are executed by pipeline operation section 330 . Further, the calculation control unit 350 performs the calculation of the cycle 3, which is the calculation of the matrix-vector product of the second submatrix A21 used in the calculation of the cycle 2 and the second partial vector vb12 used in the calculation of the cycle 1, to the It may be executed between the first operation and the second operation. This allows the arithmetic control unit 350 to further fill the empty cycles between the first arithmetic operation and the second arithmetic operation. Note that the execution order of the operations in cycles 0-3 may be arbitrary, and the execution order of the operations in cycles 4-7 may be determined according to the execution order of the corresponding operations in cycles 0-3.

図６は、本実施形態に係る演算装置３００によるパイプライン処理の第３例を示す。サイクル０と示した演算において、演算制御部３５０は、図４のサイクル０と同様の制御を行う。 FIG. 6 shows a third example of pipeline processing by the arithmetic device 300 according to this embodiment. In the calculation shown as cycle 0, the calculation control section 350 performs the same control as in cycle 0 of FIG.

サイクル１の実行開始までに、行列記憶部３２０は、第１部分行列の一例であるＡ１１および第２部分行列の一例であるＡ１２に加えて、第１行列Ａを行方向および列方向に分割した第１の複数の部分行列Ａｉｊのうち、第１部分ベクトルｖｂ１１に乗じるべき第３部分行列（一例としてＡ２１）を更に記憶する。これに代えて、第３部分行列は、第１行列Ａ以外の行列に含まれる部分行列であってもよい。 By the start of execution of cycle 1, the matrix storage unit 320 has divided the first matrix A in the row direction and the column direction in addition to A11 as an example of the first submatrix and A12 as an example of the second submatrix. Among the first plurality of submatrices Aij, a third submatrix (A21 as an example) to be multiplied by the first subvector vb11 is further stored. Alternatively, the third submatrix may be a submatrix included in a matrix other than the first matrix A.

サイクル１と示した演算において、第１部分行列および第１部分ベクトルの行列ベクトル積のパイプライン演算中に、演算制御部３５０は、第１部分ベクトルｖｂ１１の読み出しをベクトル記憶部３１０に指示し、第３部分行列Ａ２１の読み出しを行列記憶部３２０に指示するとともに、パイプラインハザードを生じさせない他の行列ベクトル積の演算として、第３部分行列Ａ２１および第１部分ベクトルｖｂ１１の行列ベクトル積の演算の実行をパイプライン演算部３３０に指示する。これを受けて、パイプライン演算部３３０は、第３部分行列Ａ２１および第１部分ベクトルｖｂ１１の行列ベクトル積を、計算途中の中間ベクトルとしてパイプライン演算部３３０が有する中間レジスタｖｃｔｍｐ２に格納する演算を実行する。この演算は、図２の第２行目における１つ目の行列ベクトル積の演算に相当し、サイクル０および１の行列ベクトル積は、互いに異なる結果ベクトルｖｃ１１およびｖｃ２１に反映されるものである。したがって、これらの演算の間に依存関係はないから、パイプライン演算部３３０は、これらの演算を、パイプラインハザードを発生させることなく実行することができる。 In the operation shown as cycle 1, during the pipeline operation of the matrix-vector product of the first partial matrix and the first partial vector, the operation control unit 350 instructs the vector storage unit 310 to read the first partial vector vb11, In addition to instructing the matrix storage unit 320 to read out the third partial matrix A21, as another matrix-vector product operation that does not cause pipeline hazards, the matrix-vector product operation of the third partial matrix A21 and the first partial vector vb11 is performed. It instructs the pipeline operation unit 330 to execute. In response to this, pipeline operation unit 330 performs an operation to store the matrix-vector product of third submatrix A21 and first subvector vb11 in intermediate register vctmp2 of pipeline operation unit 330 as an intermediate vector during calculation. Execute. This operation corresponds to the first matrix-vector product operation in the second row of FIG. 2, and the matrix-vector products of cycles 0 and 1 are reflected in different result vectors vc11 and vc21. Therefore, since there is no dependency between these operations, the pipeline operation unit 330 can execute these operations without causing pipeline hazards.

サイクル２と示した演算において、演算制御部３５０は、図４のサイクル２と同様の制御を行う。ここで、サイクル２の演算はサイクル０の演算に依存するところ、演算制御部３５０は、サイクル０の演算に依存しないサイクル１の演算をサイクル０およびサイクル２の演算の間に挿入することで、パイプライン演算部３３０のパイプラインの利用効率を上げることができる。 In the calculation shown as cycle 2, the calculation control section 350 performs the same control as in cycle 2 of FIG. Here, since the calculation of cycle 2 depends on the calculation of cycle 0, the calculation control unit 350 inserts the calculation of cycle 1, which does not depend on the calculation of cycle 0, between the calculations of cycle 0 and cycle 2. The utilization efficiency of the pipeline of the pipeline operation unit 330 can be increased.

サイクル３の実行開始までに、ベクトル記憶部３１０は、第１の複数の部分行列Ａｉｊのうち、第２部分ベクトルｖｂ２１に乗じるべき第４部分行列を更に記憶してよい。サイクル３と示した演算において、演算制御部３５０は、第２部分ベクトルの一例であるｖｂ２１の読み出しをベクトル記憶部３１０に指示し、第４部分行列の一例であるＡ２２の読み出しを行列記憶部３２０に指示するとともに、第４部分行列Ａ２２および第２部分ベクトルｖｂ２１の行列ベクトル積を計算し、サイクル１の演算の演算結果ｖｃｔｍｐ２を加える演算の実行をパイプライン演算部３３０に指示し、演算の結果得られる部分ベクトルｖｃ２１を格納することをメインメモリ３６０に指示する。ここで、サイクル３の演算はサイクル１の演算に依存するところ、演算制御部３５０は、サイクル１の演算に依存しないサイクル２の演算をサイクル１およびサイクル３の演算の間に挿入することで、パイプライン演算部３３０のパイプラインの利用効率を上げることができる。 By the start of execution of cycle 3, the vector storage unit 310 may further store a fourth submatrix by which the second subvector vb21 is to be multiplied, among the first submatrices Aij. In the calculation shown as cycle 3, the calculation control unit 350 instructs the vector storage unit 310 to read vb21, which is an example of the second partial vector, and reads A22, which is an example of the fourth partial vector, to the matrix storage unit 320. , and instructs the pipeline operation unit 330 to perform the operation of calculating the matrix-vector product of the fourth partial matrix A22 and the second partial vector vb21 and adding the operation result vctmp2 of the operation in cycle 1, and the operation result The main memory 360 is instructed to store the resulting partial vector vc21. Here, since the calculation of cycle 3 depends on the calculation of cycle 1, the calculation control unit 350 inserts the calculation of cycle 2, which does not depend on the calculation of cycle 1, between the calculations of cycle 1 and cycle 3. The utilization efficiency of the pipeline of the pipeline operation unit 330 can be increased.

本図の例では、サイクル０～３において行列Ｃの１つの列ベクトルｖｃ１に含まれる２つの部分ベクトルｖｃ１１、ｖｃ２１を計算し、サイクル４～７において行列Ｃの別の列ベクトルｖｃ２に含まれる２つの部分ベクトルｖｃ１２、ｖｃ２２を計算する。サイクル４～７の演算は、部分ベクトルｖｂ１１、ｖｂ２１に代えて部分ベクトルｖｂ１２、ｖｂ２２を用い、部分ベクトルｖｃ１１、ｖｃ２１に代えて部分ベクトルｖｃ１２、ｖｃ２２を用いる他は同様であるので説明を省略する。 In the example of this figure, two partial vectors vc11, vc21 contained in one column vector vc1 of matrix C are calculated in cycles 0-3, and two partial vectors vc11, vc21 contained in another column vector vc2 of matrix C are calculated in cycles 4-7. Compute two partial vectors vc12, vc22. Calculations in cycles 4 to 7 are the same except that partial vectors vb12 and vb22 are used instead of partial vectors vb11 and vb21, and partial vectors vc12 and vc22 are used instead of partial vectors vc11 and vc21.

本例において、演算制御部３５０は、第１部分行列および第１部分ベクトルの行列ベクトル積の第１演算と、その演算結果を利用する第２演算との間に、第１部分ベクトルを用いた他の行列ベクトル積の演算、すなわち本例においては第３部分行列および第１部分ベクトルの行列ベクトル積の演算、を挿入する。これによって、演算制御部３５０は、第１演算および第２演算の間に必要となる空きサイクルを１つ利用することができる。 In this example, the calculation control unit 350 uses the first partial vector between the first calculation of the matrix-vector product of the first partial matrix and the first partial vector and the second calculation using the calculation result. Insert another matrix-vector product operation, in this example the matrix-vector product of the third sub-matrix and the first sub-vector. As a result, the arithmetic control unit 350 can utilize one empty cycle required between the first arithmetic operation and the second arithmetic operation.

第１演算および第２演算の間に複数の空きサイクルが生じる場合、演算制御部３５０は、複数の第３部分行列のそれぞれおよび第１部分ベクトルの行列ベクトル積を第１演算および第２演算の間に挿入してよい。例えば、行列記憶部３２０は、第１行列Ａに含まれる、第１部分ベクトルに乗じるべき複数の第３部分行列Ａ２１、Ａ３１、…を記憶しておく。演算制御部３５０は、第１部分行列Ａ１１および第１部分ベクトルｖｂ１１の行列ベクトル積のパイプライン演算の開始後から演算結果が遅滞なく利用可能となる前までの間の各サイクルを、複数の第３部分行列Ａ２１、Ａ３１、…のそれぞれおよび第１部分ベクトルｖｂ１１の行列ベクトル積の演算で充填する。なお、第１部分行列および複数の第３部分行列は、第１行列の同一の行範囲において列順または列順の逆順に並んでいてもよく、また第２行列の列順に並んでおらず、それぞれ任意の列範囲の部分行列であってよい。 When a plurality of idle cycles occur between the first and second computations, the computation control unit 350 calculates the matrix-vector product of each of the plurality of third partial matrices and the first partial vectors for the first and second computations. can be inserted in between. For example, the matrix storage unit 320 stores a plurality of third submatrices A21, A31, . The operation control unit 350 controls each cycle from the start of the pipeline operation of the matrix-vector product of the first submatrix A11 and the first subvector vb11 to the time the operation result becomes available without delay as a plurality of second Fill with a matrix-vector product operation of each of the three sub-matrices A21, A31, . . . and the first sub-vector vb11. In addition, the first submatrix and the plurality of third submatrices may be arranged in the same row range of the first matrix in the column order or in reverse order of the column order, and are not arranged in the column order of the second matrix, Each can be a submatrix of any column range.

図７は、本実施形態に係る演算装置３００によるパイプライン処理の第４例を示す。パイプライン演算部３３０がより多くの中間レジスタを有する場合、または演算結果を一旦メインメモリ３６０に格納した後に利用可能となる場合等においては、演算制御部３５０は、図６におけるサイクル４～５の演算を、サイクル２～３の演算の前に行うように制御してもよい。この場合、演算制御部３５０は、第１部分行列Ａ１１および第１部分ベクトルｖｂ１１の行列ベクトル積を演算する第１演算のパイプライン演算の実行中に、第１部分ベクトルを用いた他の行列ベクトル積の演算である図７中のサイクル１の演算と、第１部分行列を用いた他の行列ベクトル積の演算である図７中のサイクル２の演算とをパイプライン演算部３３０に実行させる。また、演算制御部３５０は、サイクル１の演算に用いた第３部分行列Ａ２１と、サイクル２の演算に用いた第２部分ベクトルｖｂ１２との行列ベクトル積の演算であるサイクル３の演算を、第１演算および第２演算の間に実行させてよい。これにより、演算制御部３５０は、第１演算および第２演算の間の空きサイクルを更に充填することが可能となる。なお、サイクル０～３の演算同士の実行順序は任意であってよく、サイクル４～７の演算同士の実行順序はサイクル０～３における対応する演算の実行順序に応じて決定されてよい。ここで、図７のパイプライン処理は、図５のパイプライン処理におけるサイクル１および２の演算を入れ換え、サイクル５および６の演算を入れ換えたものと実質的に同一である。 FIG. 7 shows a fourth example of pipeline processing by the arithmetic device 300 according to this embodiment. When pipeline operation unit 330 has more intermediate registers, or when operation results become available after being temporarily stored in main memory 360, operation control unit 350 performs The operations may be controlled to occur before the operations of cycles 2-3. In this case, the operation control unit 350, during the execution of the pipeline operation of the first operation for calculating the matrix-vector product of the first partial matrix A11 and the first partial vector vb11, calculates another matrix vector using the first partial vector Cycle 1 operation in FIG. 7, which is the product operation, and cycle 2 operation in FIG. Further, the calculation control unit 350 performs the calculation of the cycle 3, which is the calculation of the matrix-vector product of the third partial matrix A21 used for the calculation of the cycle 1 and the second partial vector vb12 used for the calculation of the cycle 2, to the It may be executed between the first operation and the second operation. This allows the arithmetic control unit 350 to further fill the empty cycles between the first arithmetic operation and the second arithmetic operation. Note that the execution order of the operations in cycles 0-3 may be arbitrary, and the execution order of the operations in cycles 4-7 may be determined according to the execution order of the corresponding operations in cycles 0-3. Here, the pipeline processing of FIG. 7 is substantially the same as the pipeline processing of FIG. 5 with the operations of cycles 1 and 2 interchanged and the operations of cycles 5 and 6 interchanged.

以上に示した第１例から第４例を含む任意のパイプライン処理において、演算制御部３５０は、パイプライン演算部３３０が使用する部分ベクトルおよび部分行列を、パイプライン演算部３３０が必要とするよりも前にメインメモリ３６０からベクトル記憶部３１０および行列記憶部３２０へと転送するようにメモリ制御部３７０に指示してよい。例えば、図４の例において、メインメモリ３６０は、サイクル０の前に、部分ベクトルｖｂ１１、ｖｂ１２、ｖｂ２１、ｖｂ２２をベクトル記憶部３１０へと転送し、部分行列Ａ１１およびＡ１２を行列記憶部３２０へと転送してもよい。これに代えて、メインメモリ３６０は、サイクル０の前に、部分ベクトルｖｂ１１、ｖｂ１２をベクトル記憶部３１０へと転送し、部分行列Ａ１１を行列記憶部３２０へと転送し、サイクル２の前に、部分ベクトルｖｂ２１、ｖｂ２２をベクトル記憶部３１０へと転送し、部分行列Ａ１２を行列記憶部３２０へと転送してもよい。 In any pipeline processing including the first to fourth examples described above, the operation control unit 350 allows the pipeline operation unit 330 to obtain partial vectors and partial matrices that the pipeline operation unit 330 uses. The memory control unit 370 may be instructed to transfer the data from the main memory 360 to the vector storage unit 310 and the matrix storage unit 320 earlier. For example, in the example of FIG. 4, main memory 360 transfers sub-vectors vb11, vb12, vb21, vb22 to vector storage 310 and sub-matrices A11 and A12 to matrix storage 320 before cycle 0. may be transferred. Alternatively, main memory 360 transfers partial vectors vb11 and vb12 to vector storage 310 before cycle 0, transfers submatrix A11 to matrix storage 320, and prior to cycle 2: Partial vectors vb21 and vb22 may be transferred to vector storage section 310 and submatrix A12 may be transferred to matrix storage section 320 .

第１例および第２例に示したパイプライン処理の場合、パイプライン演算部３３０は、サイクル毎に異なる部分ベクトルｖｂ１１、ｖｂ１２、ｖｂ２１、ｖｂ２２を使用するが、部分行列Ａ１１、Ａ１２、Ａ２１、Ａ２２は２サイクルに１つずつ使用する。このため、行列記憶部３２０は、２サイクルに１つずつ部分行列を出力できるスループットを有すればよく、行列記憶部３２０の消費電力および回路規模を低減することができる。 In the case of the pipeline processing shown in the first and second examples, the pipeline operation unit 330 uses different partial vectors vb11, vb12, vb21, vb22 for each cycle. is used every two cycles. Therefore, the matrix storage unit 320 only needs to have a throughput capable of outputting one submatrix every two cycles, and the power consumption and circuit scale of the matrix storage unit 320 can be reduced.

第３例および第４例に示したパイプライン処理の場合、パイプライン演算部３３０は、サイクル毎に異なる部分行列Ａ１１、Ａ１２、Ａ２１、Ａ２２を使用するが、部分ベクトルｖｂ１１、ｖｂ１２、ｖｂ２１、ｖｂ２２は２サイクルに１つずつ使用する。このため、行列記憶部３２０は、２サイクルに１つずつ部分ベクトルを出力できるスループットを有すればよく、ベクトル記憶部３１０の消費電力および回路規模を低減することができる。 In the case of the pipeline processing shown in the third and fourth examples, the pipeline operation unit 330 uses different submatrices A11, A12, A21, and A22 for each cycle. is used every two cycles. Therefore, the matrix storage unit 320 only needs to have a throughput capable of outputting one partial vector every two cycles, and the power consumption and circuit scale of the vector storage unit 310 can be reduced.

演算装置３００の設計者または演算装置３００を使用するユーザは、演算装置３００の回路規模をより小さくできるように、または、演算装置３００の消費電力をより小さくできるように、パイプライン処理の実行順序を選択してよい。 A designer of the arithmetic device 300 or a user of the arithmetic device 300 may decide the execution order of the pipeline processing so that the circuit scale of the arithmetic device 300 can be made smaller or the power consumption of the arithmetic device 300 can be made smaller. can be selected.

本発明の様々な実施形態は、フローチャートおよびブロック図を参照して記載されてよく、ここにおいてブロックは、（１）操作が実行されるプロセスの段階または（２）操作を実行する役割を持つ装置のセクションを表わしてよい。特定の段階およびセクションが、専用回路、コンピュータ可読媒体上に格納されるコンピュータ可読命令と共に供給されるプログラマブル回路、およびコンピュータ可読媒体上に格納されるコンピュータ可読命令と共に供給されるプロセッサのいずれかによって実装されてよい。専用回路は、デジタルおよびアナログのいずれかのハードウェア回路を含んでよく、集積回路（ＩＣ）およびディスクリート回路の何れかを含んでよい。プログラマブル回路は、論理ＡＮＤ、論理ＯＲ、論理ＸＯＲ、論理ＮＡＮＤ、論理ＮＯＲ、および他の論理操作、フリップフロップ、レジスタ、フィールドプログラマブルゲートアレイ（ＦＰＧＡ）、プログラマブルロジックアレイ（ＰＬＡ）等のようなメモリ要素等を含む、再構成可能なハードウェア回路を含んでよい。 Various embodiments of the invention may be described with reference to flowchart illustrations and block diagrams, where blocks refer to (1) steps in a process in which operations are performed or (2) devices responsible for performing the operations. may represent a section of Certain steps and sections are implemented either by dedicated circuitry, by programmable circuitry provided with computer readable instructions stored on a computer readable medium, or by processors provided with computer readable instructions stored on a computer readable medium. may be Dedicated circuitry may include both digital and analog hardware circuitry, and may include both integrated circuits (ICs) and discrete circuitry. Programmable circuits include logic AND, logic OR, logic XOR, logic NAND, logic NOR, and other logic operations, memory elements such as flip-flops, registers, field programmable gate arrays (FPGAs), programmable logic arrays (PLAs), etc. and the like.

コンピュータ可読媒体は、適切なデバイスによって実行される命令を格納可能な任意の有形なデバイスを含んでよく、その結果、そこに格納される命令を有するコンピュータ可読媒体は、フローチャートまたはブロック図で指定された操作を実行するための手段を作成すべく実行され得る命令を含む、製品を備えることになる。コンピュータ可読媒体の例としては、電子記憶媒体、磁気記憶媒体、光記憶媒体、電磁記憶媒体、半導体記憶媒体等が含まれてよい。コンピュータ可読媒体のより具体的な例としては、フロッピー（登録商標）ディスク、ディスケット、ハードディスク、ランダムアクセスメモリ（ＲＡＭ）、リードオンリメモリ（ＲＯＭ）、消去可能プログラマブルリードオンリメモリ（ＥＰＲＯＭまたはフラッシュメモリ）、電気的消去可能プログラマブルリードオンリメモリ（ＥＥＰＲＯＭ）、静的ランダムアクセスメモリ（ＳＲＡＭ）、コンパクトディスクリードオンリメモリ（ＣＤ-ＲＯＭ）、デジタル多用途ディスク（ＤＶＤ）、ブルーレイ（ＲＴＭ）ディスク、メモリスティック、集積回路カード等が含まれてよい。 Computer-readable media may include any tangible device capable of storing instructions to be executed by a suitable device, such that computer-readable media having instructions stored thereon may be designated in flowcharts or block diagrams. It will comprise an article of manufacture containing instructions that can be executed to create means for performing the operations described above. Examples of computer-readable media may include electronic storage media, magnetic storage media, optical storage media, electromagnetic storage media, semiconductor storage media, and the like. More specific examples of computer readable media include floppy disks, diskettes, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), Electrically Erasable Programmable Read Only Memory (EEPROM), Static Random Access Memory (SRAM), Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD), Blu-ray (RTM) Disc, Memory Stick, Integration Circuit cards and the like may be included.

コンピュータ可読命令は、アセンブラ命令、命令セットアーキテクチャ（ＩＳＡ）命令、マシン命令、マシン依存命令、マイクロコード、ファームウェア命令、状態設定データ、またはＳｍａｌｌｔａｌｋ、ＪＡＶＡ（登録商標）、Ｃ＋＋等のようなオブジェクト指向プログラミング言語、および「Ｃ」プログラミング言語または同様のプログラミング言語のような従来の手続型プログラミング言語を含む、１または複数のプログラミング言語の任意の組み合わせで記述されたソースコードまたはオブジェクトコードのいずれかを含んでよい。 The computer readable instructions may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state setting data, or object oriented programming such as Smalltalk, JAVA, C++, etc. language, and any combination of one or more programming languages, including conventional procedural programming languages, such as the "C" programming language or similar programming languages. good.

コンピュータ可読命令は、汎用コンピュータ、特殊目的のコンピュータ、若しくは他のプログラム可能なデータ処理装置のプロセッサまたはプログラマブル回路に対し、ローカルにまたはローカルエリアネットワーク（ＬＡＮ）、インターネット等のようなワイドエリアネットワーク（ＷＡＮ）を介して提供され、フローチャートまたはブロック図で指定された操作を実行するための手段を作成すべく、コンピュータ可読命令を実行してよい。プロセッサの例としては、コンピュータプロセッサ、処理ユニット、マイクロプロセッサ、デジタル信号プロセッサ、コントローラ、マイクロコントローラ等を含む。 Computer readable instructions may be transferred to a processor or programmable circuitry of a general purpose computer, special purpose computer, or other programmable data processing apparatus, either locally or over a wide area network (WAN), such as a local area network (LAN), the Internet, or the like. ) and may be executed to create means for performing the operations specified in the flowcharts or block diagrams. Examples of processors include computer processors, processing units, microprocessors, digital signal processors, controllers, microcontrollers, and the like.

図８は、本発明の複数の態様が全体的または部分的に具現化されてよいコンピュータ２２００の例を示す。コンピュータ２２００にインストールされたプログラムは、コンピュータ２２００に、本発明の実施形態に係る装置に関連付けられる操作または当該装置の１または複数のセクションとして機能させることができてもよいし、または当該操作または当該１または複数のセクションを実行させることができてもよいし、コンピュータ２２００に、本発明の実施形態に係るプロセスまたは当該プロセスの段階を実行させることができてもよい。そのようなプログラムは、コンピュータ２２００に、本明細書に記載のフローチャートおよびブロック図のブロックのうちのいくつかまたはすべてに関連付けられた特定の操作を実行させるべく、ＣＰＵ２２１２によって実行されてよい。 FIG. 8 illustrates an example computer 2200 in which aspects of the invention may be implemented in whole or in part. Programs installed on the computer 2200 may cause the computer 2200 to function as one or more sections of or operations associated with an apparatus according to embodiments of the present invention. One or more sections may be executed and the computer 2200 may be capable of executing a process or steps of such processes according to embodiments of the present invention. Such programs may be executed by CPU 2212 to cause computer 2200 to perform certain operations associated with some or all of the blocks in the flowcharts and block diagrams described herein.

本実施形態によるコンピュータ２２００は、ＣＰＵ２２１２、ＲＡＭ２２１４、グラフィックコントローラ２２１６、およびディスプレイデバイス２２１８を含み、それらはホストコントローラ２２１０によって相互に接続されている。コンピュータ２２００はまた、通信インターフェイス２２２２、ハードディスクドライブ２２２４、ＤＶＤ－ＲＯＭドライブ２２２６、およびＩＣカードドライブのような入出力ユニットを含み、それらは入出力コントローラ２２２０を介してホストコントローラ２２１０に接続されている。コンピュータはまた、ＲＯＭ２２３０およびキーボード２２４２のようなレガシの入出力ユニットを含み、それらは入出力チップ２２４０を介して入出力コントローラ２２２０に接続されている。 Computer 2200 according to this embodiment includes CPU 2212 , RAM 2214 , graphics controller 2216 , and display device 2218 , which are interconnected by host controller 2210 . Computer 2200 also includes input/output units such as communication interface 2222 , hard disk drive 2224 , DVD-ROM drive 2226 , and IC card drive, which are connected to host controller 2210 via input/output controller 2220 . The computer also includes legacy input/output units such as ROM 2230 and keyboard 2242 , which are connected to input/output controller 2220 through input/output chip 2240 .

ＣＰＵ２２１２は、ＲＯＭ２２３０およびＲＡＭ２２１４内に格納されたプログラムに従い動作し、それにより各ユニットを制御する。グラフィックコントローラ２２１６は、ＲＡＭ２２１４内に提供されるフレームバッファ等またはそれ自体の中にＣＰＵ２２１２によって生成されたイメージデータを取得し、イメージデータがディスプレイデバイス２２１８上に表示されるようにする。 CPU 2212 operates according to programs stored in ROM 2230 and RAM 2214, thereby controlling each unit. Graphics controller 2216 retrieves image data generated by CPU 2212 into itself, such as a frame buffer provided in RAM 2214 , and causes the image data to be displayed on display device 2218 .

通信インターフェイス２２２２は、ネットワークを介して他の電子デバイスと通信する。ハードディスクドライブ２２２４は、コンピュータ２２００内のＣＰＵ２２１２によって使用されるプログラムおよびデータを格納する。ＤＶＤ－ＲＯＭドライブ２２２６は、プログラムまたはデータをＤＶＤ－ＲＯＭ２２０１から読み取り、ハードディスクドライブ２２２４にＲＡＭ２２１４を介してプログラムまたはデータを提供する。ＩＣカードドライブは、プログラムおよびデータをＩＣカードから読み取り、プログラムおよびデータをＩＣカードに書き込む。 Communication interface 2222 communicates with other electronic devices over a network. Hard disk drive 2224 stores programs and data used by CPU 2212 within computer 2200 . DVD-ROM drive 2226 reads programs or data from DVD-ROM 2201 and provides programs or data to hard disk drive 2224 via RAM 2214 . The IC card drive reads programs and data from IC cards and writes programs and data to IC cards.

ＲＯＭ２２３０はその中に、アクティブ化時にコンピュータ２２００によって実行されるブートプログラム等、およびコンピュータ２２００のハードウェアに依存するプログラムのいずれかを格納する。入出力チップ２２４０はまた、様々な入出力ユニットをパラレルポート、シリアルポート、キーボードポート、マウスポート等を介して、入出力コントローラ２２２０に接続してよい。 ROM 2230 stores therein any programs that are dependent on the hardware of computer 2200, such as a boot program that is executed by computer 2200 upon activation. Input/output chip 2240 may also connect various input/output units to input/output controller 2220 via parallel ports, serial ports, keyboard ports, mouse ports, and the like.

プログラムが、ＤＶＤ－ＲＯＭ２２０１またはＩＣカードのようなコンピュータ可読媒体によって提供される。プログラムは、コンピュータ可読媒体から読み取られ、コンピュータ可読媒体の例でもあるハードディスクドライブ２２２４、ＲＡＭ２２１４、またはＲＯＭ２２３０にインストールされ、ＣＰＵ２２１２によって実行される。これらのプログラム内に記述される情報処理は、コンピュータ２２００に読み取られ、プログラムと、上記様々なタイプのハードウェアリソースとの間の連携をもたらす。装置または方法が、コンピュータ２２００の使用に従い情報の操作または処理を実現することによって構成されてよい。 A program is provided by a computer-readable medium such as a DVD-ROM 2201 or an IC card. The program is read from a computer-readable medium, installed in hard disk drive 2224 , RAM 2214 , or ROM 2230 , which are also examples of computer-readable medium, and executed by CPU 2212 . The information processing described within these programs is read by computer 2200 to provide coordination between the programs and the various types of hardware resources described above. An apparatus or method may be configured by implementing the manipulation or processing of information in accordance with the use of computer 2200 .

例えば、通信がコンピュータ２２００および外部デバイス間で実行される場合、ＣＰＵ２２１２は、ＲＡＭ２２１４にロードされた通信プログラムを実行し、通信プログラムに記述された処理に基づいて、通信インターフェイス２２２２に対し、通信処理を命令してよい。通信インターフェイス２２２２は、ＣＰＵ２２１２の制御下、ＲＡＭ２２１４、ハードディスクドライブ２２２４、ＤＶＤ－ＲＯＭ２２０１、またはＩＣカードのような記録媒体内に提供される送信バッファ処理領域に格納された送信データを読み取り、読み取られた送信データをネットワークに送信し、またはネットワークから受信された受信データを記録媒体上に提供される受信バッファ処理領域等に書き込む。 For example, when communication is performed between the computer 2200 and an external device, the CPU 2212 executes a communication program loaded into the RAM 2214 and sends communication processing to the communication interface 2222 based on the processing described in the communication program. you can command. The communication interface 2222 reads transmission data stored in a transmission buffer processing area provided in a recording medium such as the RAM 2214, the hard disk drive 2224, the DVD-ROM 2201, or an IC card under the control of the CPU 2212, and transmits the read transmission data. Data is transmitted to the network, or received data received from the network is written to a receive buffer processing area or the like provided on the recording medium.

また、ＣＰＵ２２１２は、ハードディスクドライブ２２２４、ＤＶＤ－ＲＯＭドライブ２２２６（ＤＶＤ－ＲＯＭ２２０１）、ＩＣカード等のような外部記録媒体に格納されたファイルまたはデータベースの全部または必要な部分がＲＡＭ２２１４に読み取られるようにし、ＲＡＭ２２１４上のデータに対し様々なタイプの処理を実行してよい。ＣＰＵ２２１２は次に、処理されたデータを外部記録媒体にライトバックする。 In addition, the CPU 2212 causes the RAM 2214 to read all or necessary portions of files or databases stored in external recording media such as a hard disk drive 2224, a DVD-ROM drive 2226 (DVD-ROM 2201), an IC card, etc. Various types of processing may be performed on the data in RAM 2214 . CPU 2212 then writes back the processed data to the external recording medium.

様々なタイプのプログラム、データ、テーブル、およびデータベースのような様々なタイプの情報が記録媒体に格納され、情報処理を受けてよい。ＣＰＵ２２１２は、ＲＡＭ２２１４から読み取られたデータに対し、本開示の随所に記載され、プログラムの命令シーケンスによって指定される様々なタイプの操作、情報処理、条件判断、条件分岐、無条件分岐、情報の検索および置換等のいずれかを含む、様々なタイプの処理を実行してよく、結果をＲＡＭ２２１４に対しライトバックする。また、ＣＰＵ２２１２は、記録媒体内のファイル、データベース等における情報を検索してよい。例えば、各々が第２の属性の属性値に関連付けられた第１の属性の属性値を有する複数のエントリが記録媒体内に格納される場合、ＣＰＵ２２１２は、第１の属性の属性値が指定される、条件に一致するエントリを当該複数のエントリの中から検索し、当該エントリ内に格納された第２の属性の属性値を読み取り、それにより予め定められた条件を満たす第１の属性に関連付けられた第２の属性の属性値を取得してよい。 Various types of information, such as various types of programs, data, tables, and databases, may be stored on recording media and subjected to information processing. CPU 2212 performs various types of operations on data read from RAM 2214, information processing, conditional decision making, conditional branching, unconditional branching, and information retrieval, as specified throughout this disclosure and by instruction sequences of programs. Various types of processing may be performed, including any of , permutation, etc., and the results written back to RAM 2214 . In addition, the CPU 2212 may search for information in a file in a recording medium, a database, or the like. For example, if a plurality of entries each having an attribute value of a first attribute associated with an attribute value of a second attribute are stored in the recording medium, the CPU 2212 determines that the attribute value of the first attribute is specified. search the plurality of entries for an entry that matches the condition, read the attribute value of the second attribute stored in the entry, and thereby associate it with the first attribute that satisfies the predetermined condition. an attribute value of the second attribute obtained.

上で説明したプログラムまたはソフトウェアモジュールは、コンピュータ２２００上またはコンピュータ２２００近傍のコンピュータ可読媒体に格納されてよい。また、専用通信ネットワークまたはインターネットに接続されたサーバーシステム内に提供されるハードディスクまたはＲＡＭのような記録媒体が、コンピュータ可読媒体として使用可能であり、それによりプログラムを、ネットワークを介してコンピュータ２２００に提供する。 The programs or software modules described above may be stored in a computer readable medium on or near computer 2200 . Also, a recording medium such as a hard disk or RAM provided in a server system connected to a dedicated communication network or the Internet can be used as a computer-readable medium, thereby providing the program to the computer 2200 via the network. do.

以上、本発明を実施の形態を用いて説明したが、本発明の技術的範囲は上記実施の形態に記載の範囲には限定されない。上記実施の形態に、多様な変更または改良を加えることが可能であることが当業者に明らかである。その様な変更または改良を加えた形態も本発明の技術的範囲に含まれ得ることが、特許請求の範囲の記載から明らかである。 Although the present invention has been described above using the embodiments, the technical scope of the present invention is not limited to the scope described in the above embodiments. It is obvious to those skilled in the art that various modifications and improvements can be made to the above embodiments. It is clear from the description of the scope of claims that forms with such modifications or improvements can also be included in the technical scope of the present invention.

特許請求の範囲、明細書、および図面中において示した装置、システム、プログラム、および方法における動作、手順、ステップ、および段階等の各処理の実行順序は、特段「より前に」、「先立って」等と明示しておらず、また、前の処理の出力を後の処理で用いるのでない限り、任意の順序で実現しうることに留意すべきである。特許請求の範囲、明細書、および図面中の動作フローに関して、便宜上「まず、」、「次に、」等を用いて説明したとしても、この順で実施することが必須であることを意味するものではない。 The execution order of each process such as actions, procedures, steps, and stages in the devices, systems, programs, and methods shown in the claims, the specification, and the drawings is particularly "before", "before etc., and it should be noted that they can be implemented in any order unless the output of the previous process is used in the subsequent process. Regarding the operation flow in the claims, the specification, and the drawings, even if the description is made using "first," "next," etc. for the sake of convenience, it means that it is essential to carry out in this order. not a thing

３００演算装置
３１０ベクトル記憶部
３２０行列記憶部
３３０パイプライン演算部
３４０結果記憶部
３５０演算制御部
３６０メインメモリ
３７０メモリ制御部
２２００コンピュータ
２２０１ＤＶＤ－ＲＯＭ
２２１０ホストコントローラ
２２１２ＣＰＵ
２２１４ＲＡＭ
２２１６グラフィックコントローラ
２２１８ディスプレイデバイス
２２２０入出力コントローラ
２２２２通信インターフェイス
２２２４ハードディスクドライブ
２２２６ＤＶＤ－ＲＯＭドライブ
２２３０ＲＯＭ
２２４０入出力チップ
２２４２キーボード 300 arithmetic unit 310 vector storage unit 320 matrix storage unit 330 pipeline operation unit 340 result storage unit 350 operation control unit 360 main memory 370 memory control unit 2200 computer 2201 DVD-ROM
2210 host controller 2212 CPU
2214 RAM
2216 graphic controller 2218 display device 2220 input/output controller 2222 communication interface 2224 hard disk drive 2226 DVD-ROM drive 2230 ROM
2240 input/output chip 2242 keyboard

Claims

a vector storage unit that stores at least a first partial vector among a plurality of first partial vectors obtained by dividing the first vector;
a matrix storage unit that stores at least a first submatrix by which the first partial vector is to be multiplied, among a plurality of first submatrices obtained by dividing the first matrix by which the first vector is to be multiplied in the row direction and the column direction;
a pipeline operation unit capable of executing an operation of adding an intermediate vector to the matrix-vector product of the partial matrix stored in the matrix storage unit and the partial vector stored in the vector storage unit by pipeline operation;
The pipeline operation unit performs another matrix-vector product operation using the first partial vector or the first partial matrix during the pipeline operation of the matrix-vector product of the first partial matrix and the first partial vector. and an arithmetic control unit that instructs the pipeline arithmetic unit to execute the above.

the vector storage unit further stores a second partial vector among the first plurality of partial vectors;
the matrix storage unit further stores a second submatrix to be multiplied by the second subvector among the first plurality of submatrices;
After a cycle in which the calculation result of the matrix-vector product of the first submatrix and the first partial vector becomes available without delay, the operation control unit controls the matrix-vector product of the second partial matrix and the second partial vector to an operation result of the matrix-vector product of the first submatrix and the first partial vector.

the vector storage unit further stores a third partial vector by which the first submatrix is to be multiplied, among a plurality of second partial vectors obtained by dividing a second vector by which the first matrix is to be multiplied;
The operation control unit, during the pipeline operation of the matrix-vector product of the first partial matrix and the first partial vector, performs the calculation of the other matrix-vector product of the first partial matrix and the third partial vector. 3. The arithmetic unit according to claim 1, wherein the pipeline operation unit is instructed to execute a matrix-vector multiplication operation.

4. The arithmetic unit according to claim 3, wherein said first vector and said second vector are column vectors included in a second matrix by which said first matrix is to be multiplied.

the vector storage unit stores a plurality of the second vectors included in the second matrix;
The operation control unit controls each cycle from the start of the pipeline operation of the matrix-vector product of the first partial matrix and the first partial vector until the operation result becomes available without delay to the first 5. The arithmetic unit of claim 4, wherein the arithmetic unit fills with a matrix-vector product operation of the third partial vector from each of the submatrices and the plurality of second vectors.

the matrix storage unit further stores a third submatrix by which the first subvector is to be multiplied, among the first plurality of submatrices;
The operation control unit performs the operation of the third partial matrix and the first partial vector as the operation of the other matrix-vector product during the pipeline operation of the matrix-vector product of the first partial matrix and the first partial vector. 3. The arithmetic unit according to claim 1, wherein the pipeline operation unit is instructed to execute a matrix-vector multiplication operation.

The matrix storage unit stores a plurality of the third submatrices,
The operation control unit controls each cycle from the start of the pipeline operation of the matrix-vector product of the first partial matrix and the first partial vector until the operation result becomes available without delay to the plurality of 7. The arithmetic unit of claim 6, wherein padding with a matrix-vector product operation of each of the third sub-matrix and the first sub-vector.

the vector storage unit stores at least the first partial vector among the first plurality of partial vectors obtained by dividing the first vector;
a matrix storage unit that stores at least a first submatrix by which the first partial vector is to be multiplied, among a plurality of first submatrices obtained by dividing the first matrix by which the first vector is to be multiplied in the row direction and the column direction;
a pipeline operation unit capable of executing an operation of adding an intermediate vector to a matrix-vector product of a submatrix stored in the matrix storage unit and a partial vector stored in the vector storage unit by pipeline operation; during a pipeline operation of a matrix-vector product of a sub-matrix and said first sub-vector, beginning execution of another matrix-vector product operation using said first sub-vector or said first sub-matrix.

An arithmetic program executed by an arithmetic device,
The computing device is
a vector storage unit that stores at least a first partial vector among a plurality of first partial vectors obtained by dividing the first vector;
a matrix storage unit that stores at least a first submatrix by which the first partial vector is to be multiplied, among a plurality of first submatrices obtained by dividing the first matrix by which the first vector is to be multiplied in the row direction and the column direction;
a pipeline operation unit capable of executing an operation of adding an intermediate vector to the matrix-vector product of the partial matrix stored in the matrix storage unit and the partial vector stored in the vector storage unit by pipeline operation,
The computing program causes the computing device to generate another matrix vector using the first partial vector or the first partial matrix during the pipeline operation of the matrix-vector product of the first partial matrix and the first partial matrix. A calculation program that initiates the execution of the product calculation.