WO2020246598A1

WO2020246598A1 - Calculation device, calculation method, and calculation program

Info

Publication number: WO2020246598A1
Application number: PCT/JP2020/022377
Authority: WO
Inventors: 淳一郎牧野; 戎崎　俊一
Original assignee: 国立研究開発法人理化学研究所
Priority date: 2019-06-07
Filing date: 2020-06-05
Publication date: 2020-12-10
Also published as: JP7241397B2; JP2020201659A

Abstract

Provided is a calculation device comprising: a vector storage unit which stores, among a plurality of first partial vectors obtained by dividing a first vector, at least a first partial vector; a matrix storage unit which stores, among a plurality of first submatrixes obtained by dividing a first matrix to be multiplied by the first vector in the row direction and the column direction, at least a first submatrix to be multiplied by the first partial vector; a pipeline calculation unit which, through pipeline calculation, executes calculation for adding an intermediate vector to a matrix vector product of the submatrix stored in the matrix storage unit and the partial vector stored in the vector storage unit; and a calculation control unit which while the pipeline calculation unit executes the pipeline calculation of the matrix vector product of the first submatrix and the first partial vector, instructs the pipeline calculation unit to execute the calculation of another matrix vector product using the first partial vector or the first submatrix.

Description

Arithmetic logic unit, arithmetic method, and arithmetic program

The present invention relates to an arithmetic unit, an arithmetic method, and an arithmetic program.

In various applications such as numerical calculation and deep learning, matrix matrix product (hereinafter referred to as "matrix product") and matrix vector product occupy most of the calculation amount. Therefore, arithmetic units and arithmetic methods for efficiently executing such matrix operations have been developed (see Patent Documents 1 to 3). Processors capable of performing matrix operations have also been developed.
[Prior art literature]
[Patent Document]
[Patent Document 1] International Publication No. 2018/207926 [Patent Document 2] Japanese Unexamined Patent Publication No. 2018-139045 [Patent Document 3] Japanese Unexamined Patent Publication No. 2018-197906

Problems to be solved

The matrix-vector product of the n-dimensional square matrix and the n-dimensional vector includes multiplication of n2 and addition of about n2, and has a calculation amount of about 2n2. Therefore, when the n-dimensional square matrix is fixed, the amount of calculation of the matrix vector product is on the order of n2 with respect to the input of the n-dimensional vector. Therefore, if the matrix size is increased and the matrix algorithm is increased, the ratio of the data load amount to the calculation amount can be reduced. However, when the matrix arithmetic unit is made large, the load / store capacity of a register file or the like becomes relatively low, and the processing performance of operations on a matrix having a small size and operations other than the matrix becomes relatively low.

General disclosure

In the first aspect of the present invention, an arithmetic unit is provided. The arithmetic unit may include a vector storage unit that stores at least the first partial vector among the first plurality of partial vectors obtained by dividing the first vector. The arithmetic unit has a matrix storage unit that stores at least the first matrix to be multiplied by the first vector among the first plurality of matrix obtained by dividing the first matrix to be multiplied by the first vector in the row direction and the column direction. You may be prepared. The arithmetic unit includes a pipeline arithmetic unit capable of executing an operation of adding an intermediate vector to the matrix vector product of the submatrix stored in the matrix storage unit and the partial vector stored in the vector storage unit by the pipeline operation. You can. In the arithmetic unit, the pipeline arithmetic unit performs an operation of another matrix vector product using the first partial vector or the first partial matrix during the pipeline operation of the matrix vector product of the first partial matrix and the first partial vector. An arithmetic control unit that instructs the pipeline arithmetic unit to execute may be provided.

The vector storage unit may further store the second partial vector among the first plurality of partial vectors. The matrix storage unit may further store the second submatrix to be multiplied by the second submatrix among the first plurality of submatrixes. The arithmetic control unit performs the matrix-vector product of the second sub-matrix and the second sub-vector after the cycle in which the arithmetic result of the matrix-vector product of the first sub-matrix and the first sub-vector becomes available without delay. The pipeline calculation unit may be instructed to execute an operation to be added to the operation result of the matrix vector product of the matrix and the first partial vector.

The vector storage unit may further store the third partial vector to be multiplied by the first submatrix among the second plurality of partial vectors obtained by dividing the second vector to be multiplied by the first matrix. During the pipeline operation of the matrix-vector product of the first sub-matrix and the first sub-vector, the arithmetic control unit performs the operation of the matrix-vector product of the first sub-matrix and the third sub-vector as the operation of the other matrix-vector product. The execution may be instructed to the pipeline calculation unit.

The first vector and the second vector may be column vectors included in the second matrix to be multiplied by the first matrix.

The vector storage unit may store a plurality of second vectors included in the second matrix. The arithmetic control unit performs each cycle from the start of the pipeline operation of the matrix vector product of the first submatrix and the first subvector to before the operation result becomes available without delay in the first submatrix and a plurality of cycles. It may be filled by the calculation of the matrix vector product of the third partial vector from each of the second vectors of.

The matrix storage unit may further store the third submatrix to be multiplied by the first submatrix among the first plurality of submatrixes. During the pipeline operation of the matrix-vector product of the first sub-matrix and the first sub-vector, the arithmetic control unit performs the operation of the matrix-vector product of the third sub-matrix and the first sub-vector as the operation of the other matrix-vector product. The execution may be instructed to the pipeline calculation unit.

The matrix storage unit may store a plurality of third submatrixes. The arithmetic control unit performs each cycle from the start of the pipeline operation of the matrix vector product of the first submatrix and the first subvector to before the operation result becomes available without delay, and a plurality of third submatrix. Each of the above and the matrix vector product of the first partial vector may be filled.

In the second aspect of the present invention, a calculation method is provided. The calculation method may include that the vector storage unit stores at least the first partial vector among the first plurality of partial vectors obtained by dividing the first vector. In the calculation method, the matrix storage unit stores at least the first submatrix to be multiplied by the first submatrix among the first plurality of submatrixes obtained by dividing the first matrix to be multiplied by the first vector in the row direction and the column direction. You may be prepared to do. The calculation method is a pipeline calculation unit that can execute an operation to add an intermediate vector to the matrix vector product of the partial matrix stored in the matrix storage unit and the partial vector stored in the vector storage unit by pipeline operation. During the pipeline operation of the matrix-vector product of the first partial matrix and the first partial vector, it may be provided to start executing the operation of another matrix-vector product using the first partial vector or the first partial matrix.

In the third aspect of the present invention, an arithmetic program executed by an arithmetic unit is provided. The arithmetic unit may include a vector storage unit that stores at least the first partial vector among the first plurality of partial vectors obtained by dividing the first vector. The arithmetic unit includes at least a matrix storage unit for the first submatrix to be multiplied by the first submatrix among the first plurality of submatrixes obtained by dividing the first matrix to be multiplied by the first vector in the row direction and the column direction. Good. The arithmetic unit includes a pipeline arithmetic unit capable of executing an operation of adding an intermediate vector to the matrix vector product of the submatrix stored in the matrix storage unit and the partial vector stored in the vector storage unit by the pipeline operation. You can. The arithmetic program causes the arithmetic unit to execute an operation of the first partial vector or another matrix vector product using the first partial matrix during the pipeline operation of the matrix vector product of the first partial matrix and the first partial vector. It may be for getting started.

The outline of the above invention does not list all the necessary features of the present invention. Sub-combinations of these feature groups can also be inventions.

An example of the matrix operation according to this embodiment is shown. An example of a calculation formula obtained by decomposing the matrix operation according to the present embodiment into a matrix-vector product of a submatrix and a partial vector is shown. The configuration of the arithmetic unit 300 according to this embodiment is shown. A first example of pipeline processing by the arithmetic unit 300 according to the present embodiment is shown. A second example of pipeline processing by the arithmetic unit 300 according to the present embodiment is shown. A third example of pipeline processing by the arithmetic unit 300 according to the present embodiment is shown. A fourth example of pipeline processing by the arithmetic unit 300 according to the present embodiment is shown. FIG. 8 shows an example of a computer 2200 in which a plurality of aspects of the present invention may be embodied in whole or in part.

Hereinafter, the present invention will be described through embodiments of the invention, but the following embodiments do not limit the inventions claimed. Also, not all combinations of features described in the embodiments are essential to the means of solving the invention.

FIG. 1 shows an example of a matrix operation according to this embodiment. This figure shows a matrix operation C = A × B that calculates the matrix product of the matrix A and the matrix B and substitutes it into the matrix C. The matrices A, B, and C are square matrices of 8 rows and 8 columns.

Aij (i = 1, 2, ... 8, j = 1, 2, ... 8) is a component (also referred to as "element") of the matrix A. bij (i = 1,2, ... 8, j = 1,2, ... 8) is a component of the matrix B. cij (i = 1,2, ... 8, j = 1,2, ... 8) is a component of the matrix C. Regarding the range of j ≧ 3, the illustration of each component of bij and cij is omitted.

Further, the column vector of the j column of the matrix B, that is, the vector having each component bij (i = 1, 2, ... 8) of the j column of the matrix B as a component is the vector vbj, and the column vector of the j column of the matrix C is the vector. It is shown as vcj. That is, it is shown as vector vbj = (b1j, b2j, ..., b8j) T and vector vcj = (c1j, c2j, ..., c8j) T. At this time, the vector vcj can be calculated by the matrix vector product vcj = A × vbj of the matrix A and the vector vbj.

Here, for example, when using an arithmetic unit capable of executing the matrix-vector product of a 4-by-4 matrix and a 4-element vector as a unit operation, the arithmetic unit calculates the matrix operation C = A × B at once. Divide into units that can be done. In this figure, the matrix A is shown as A11, A21, A12, and A22 as sub-matrix obtained by dividing the matrix A into two in the row direction and the column direction, respectively. The submatrix Amn (m = 1, 2, n = 1, 2) is a portion of the components of the mth row range divided in the row direction and the nth column range divided in the column direction in the matrix A. It is a component of the matrix. Further, the partial vectors obtained by dividing the vector vbj into two in the row direction are shown as vb1j and vb2j. vbmj (m = 1, 2) uses the component of the m-th row range divided in the row direction in the vector vbj as the component of the partial vector. Further, the partial vectors obtained by dividing the vector vcj into two in the row direction are referred to as vc1j and vc2j. For vcmj (m = 1, 2), the component of the m-th row range divided in the row direction in the vector vcj is used as the component of the partial vector.

In this figure, the matrix product is shown as an example of the matrix operation, and the matrix vector product is described as being included in a part of the matrix product. The matrix vector product not included in the matrix product operation is the same as the matrix vector product vc1 = A × vb1 related to the first column of the matrix B and the matrix C. Also, in this embodiment, the matrices A, B, and C have powers of 2 elements in the row and column directions, and the matrix A is divided into powers of 2 in the row and column directions. The case will be illustrated. Alternatively, the matrices A, B, and C may have a number of elements other than powers of 2 for at least one in the row or column direction, with the matrix A at least in the row or column direction. One may be divided into numbers other than powers of 2 (eg, 3x3, 5x5, 9x9, 3x5, 5x9, etc.).

FIG. 2 shows an example of a calculation formula obtained by decomposing the matrix operation according to the present embodiment into a matrix-vector product of a submatrix and a partial vector. The matrix vector product vcj = A × vbj of the matrix A and the vector vbj is vc1j = (A11 A12) × vbj = A11 × vb1j + A12 × vb2j for calculating the partial vector vc1j and vc2j = (A21 A22) for calculating the partial vector vc2j. It can be divided into × vbj = A21 × vb1j + A22 × vb2j. That is, when j = 1, vc11 = A11 × vb11 + A12 × vb21 and vc21 = A21 × vb11 + A22 × vb21. Further, when j = 2, vc12 = A11 × vb12 + A12 × vb22, vc22 = A21 × vb12 + A22 × vb22. Hereinafter, the same applies to j = 3, ..., 8.

In this way, when the matrix A is divided into d pieces in the row direction and the column direction, and the vector vb is divided into d pieces, the matrix vector product of the matrix A and the vector vb is the matrix vector product of the partial matrix and the partial vector. It includes d × d pieces. When the arithmetic unit has only a register capable of storing a single submatrix, the arithmetic unit performs the matrix operation shown in FIG. 2 while sequentially loading the submatrix from the memory into the register, which deteriorates the processing performance. It ends up.

FIG. 3 shows the configuration of the arithmetic unit 300 according to the present embodiment. The arithmetic unit 300 can execute the matrix vector product of the matrix up to the number of rows and columns specified in the specifications and the vector up to the number of rows specified in the specifications as a unit operation by pipeline operation. The arithmetic unit 300 divides a matrix-vector product of a matrix and a vector larger than a size that can be processed by one unit of operation into a plurality of sets of a matrix-vector product of a partial matrix and a partial vector that can be processed by one unit of operation. calculate.

The arithmetic unit 300 includes a vector storage unit 310, a matrix storage unit 320, a pipeline calculation unit 330, a result storage unit 340, an arithmetic control unit 350, a main memory 360, and a memory control unit 370. The vector storage unit 310 stores at least the first partial vector among the first plurality of partial vectors obtained by dividing the first vector. In the present embodiment, the vector storage unit 310 is a register as an example. Instead of this, the vector storage unit 310 may be another storage device such as a cache memory that can supply the partial vector to the pipeline calculation unit 330 in a pipeline manner.

Here, the first vector is an object vector to which at least one submatrix is multiplied by one matrix stored in the matrix storage unit 320. The first vector is larger than the size that the arithmetic unit 300 can process in one unit of arithmetic. The first plurality of partial vectors are obtained by dividing the first vector into a size that can be processed by one unit of operation. In the matrix operation of FIG. 1, the first vector corresponds to any of the vectors vbj (for example, vb1). The first plurality of partial vectors correspond to the partial vectors vbig (i = 1, 2) obtained by dividing the first vector vbj. If the first vector is larger, the first vector may be divided into three or more subvectors.

Further, the vector storage unit 310 may have a sufficient storage area for further storing the second partial vector and the other partial vectors among the first plurality of partial vectors. For example, in the matrix operation of FIG. 1, the vector storage unit 310 may store the partial vector vb1j and the partial vector vb2j.

The matrix storage unit 320 stores at least the first submatrix to be multiplied by the first submatrix among the first plurality of submatrixes obtained by dividing the first matrix to be multiplied by the first vector in the row direction and the column direction. In the present embodiment, the matrix storage unit 320 is a register as an example. Instead of this, the matrix storage unit 320 may be another storage device such as a cache memory that can supply a submatrix to the pipeline calculation unit 330 in a pipeline manner.

Here, the first matrix is an object matrix in which at least one partial vector is to be multiplied by the first vector stored in the vector storage unit 310. The first matrix is larger than the size that the arithmetic unit 300 can process in one unit of arithmetic. The first plurality of sub-matrixes are obtained by dividing the first matrix into a size that can be processed by the arithmetic unit 300 by one unit of arithmetic operations. In the matrix operation of FIG. 1, the first matrix corresponds to the matrix A. The first plurality of sub-matrixes correspond to the sub-matrix Aij (i = 1, 2, j = 1, 2) obtained by dividing the first matrix A in the row direction and the column direction. If the first matrix is larger, the first matrix may be divided into three or more in each of the row and column directions.

Further, the matrix storage unit 320 may have a sufficient storage area for further storing the second submatrix to be multiplied by the second submatrix and the other submatrix among the first plurality of submatrixes. Good. For example, in the matrix operation of FIG. 1, the matrix storage unit 320 may store the submatrix A11 to be multiplied by the submatrix vb1j and the submatrix A12 to be multiplied by the submatrix vb2j. Here, the first partial vector and the second partial vector are located in different row ranges in the first vector. Therefore, the first submatrix and the second submatrix are located in different column ranges in the target matrix. The first sub-matrix and the second sub-matrix may be located in the same row range in the target matrix.

The pipeline calculation unit 330 is connected to the vector storage unit 310 and the matrix storage unit 320, receives the partial vector of the calculation target stored in the vector storage unit 310 from the vector storage unit 310, and stores the calculation in the matrix storage unit 320. The target submatrix is received from the matrix storage unit 320. The pipeline calculation unit 330 can execute a calculation of adding an intermediate vector to the matrix-vector product of the submatrix and the partial vector to be calculated by the pipeline calculation. In the present embodiment, the pipeline calculation unit 330 calculates the matrix vector product of the submatrix of 4 rows and 4 columns and the subvector of 4 rows, and adds the intermediate vector of 4 rows to the matrix vector product to obtain the calculation result. The operation of calculating the partial vector (also referred to as “result vector”) can be executed as a unit operation.

Here, the fact that it can be executed as a unit of operation means that the pipeline calculation unit 330 sets the matrix-vector product of the sub-matrix and the sub-vector to be calculated in response to a request such as an external instruction or execution of an instruction. It means that the operations that add intermediate vectors are executed together and the result is output. The pipeline calculation unit 330 may have a large number of calculation units so that all the basic operations (for example, multiplication and addition of values) included in this operation can be performed by separate calculation units. The operation of the unit may be performed by the same arithmetic unit.

Further, when the pipeline calculation unit 330 performs a pipeline calculation, it means that the pipeline calculation unit 330 can operate in parallel when the pipeline calculation unit 330 outputs the result after processing in a plurality of stages after the start of the calculation. To do. That is, the pipeline calculation unit 330 can sequentially start other operations in each cycle from the start of a certain operation to the output of the result, if there is no particular obstacle in execution.

For example, the pipeline calculation unit 330 inputs a submatrix and a subvector in the first cycle, multiplies the corresponding elements of the submatrix and the subvector in the second cycle, and obtains the result vector in the third cycle. The product calculated in the second cycle may be summed for each element to be included, and the partial vector of the calculation result may be output in the fourth cycle. The pipeline calculation unit 330 can have a pipeline structure having an arbitrary number of stages, if necessary.

The result storage unit 340 is connected to the pipeline calculation unit 330. The result storage unit 340 receives and stores the result vector output by the pipeline calculation unit 330. The result vector is, for example, vc11 and vc21 in FIG. In the present embodiment, the result storage unit 340 is a register as an example. Instead of this, the result storage unit 340 may be another storage device such as a cache memory that can store the partial vector from the pipeline calculation unit 330 in a pipeline manner. The vector storage unit 310, the matrix storage unit 320, and the result storage unit 340 may be implemented as the same storage device.

The calculation control unit 350 is connected to the vector storage unit 310, the matrix storage unit 320, the pipeline calculation unit 330, and the result storage unit 340. The arithmetic control unit 350 requests in response to a matrix operation execution request such as receiving an instruction from the outside of the arithmetic unit 300 or decoding a matrix operation instruction during program execution in the arithmetic unit 300. The vector storage unit 310, the matrix storage unit 320, the pipeline calculation unit 330, and the result storage unit 340 are controlled in order to execute the performed matrix operation.

The main memory 360 stores the matrix to be calculated and the calculation result. The memory control unit 370 is connected between the vector storage unit 310, the matrix storage unit 320, the result storage unit 340, and the main memory 360. The memory control unit 370 is located between the vector storage unit 310, the matrix storage unit 320, the result storage unit 340, and the main memory 360 in response to an external instruction or a memory access instruction during program execution in the arithmetic unit 300. Data transfer.

For example, the memory control unit 370 mainly uses the partial vector specified by the vector load among the partial vectors stored in the main memory 360 in response to the request for the vector load from the main memory 360 to the vector storage unit 310. It is read from the memory 360 and stored in the vector storage unit 310. Further, the memory control unit 370 mainly uses the submatrix designated by the matrix load among the submatrix stored in the main memory 360 in response to the request for the matrix load from the main memory 360 to the matrix storage unit 320. It is read from the memory 360 and stored in the matrix storage unit 320. Further, the memory control unit 370 reads out the matrix or vector of the calculation result stored in the result storage unit 340 in response to the request for the matrix or vector store from the result storage unit 340 to the main memory 360. It is stored in the main memory 360. Depending on the design of the arithmetic unit 300, the main memory 360 may not be provided in addition to the vector storage unit 310 and the matrix storage unit 320, but a relatively large memory that functions as the vector storage unit 310 and the matrix storage unit 320 may be provided. A partial vector and a partial matrix may be supplied to the pipeline calculation unit 330 directly from the memory in a pipeline.

In the configuration shown above, the pipeline calculation unit 330 executes the calculation of the matrix vector product by the pipeline processing. For example, when performing the calculation of vc11 = A11 × vb11 + A12 × vb21 shown in FIG. 2, the pipeline calculation unit 330 starts the calculation of the first submatrix A11 and the first submatrix vector vb11 until the calculation result is obtained. It takes multiple cycles. Therefore, the pipeline calculation unit 330 sets the second submatrix A12 and the second part in the cycle following the cycle in which the first operation for calculating the matrix vector product of the first submatrix A11 and the first submatrix vector vb11 is started. Even if the second operation that adds the matrix-vector product of the vector vb21 to the result of the first operation is input, the execution of the second operation is hindered (pipeline hazard), and the operation result of the first operation becomes available. It becomes necessary to wait until the processing of the second calculation.

Depending on the pipeline design, the waiting time for the second operation may be reduced to some extent by supplying the result of the first operation to the second operation (bypass or forwarding) without waiting for the result to be written to the register. Can be done. However, since there is a dependency between the first operation and the second operation, it is difficult to completely eliminate the vacancy generated in the pipeline of the pipeline calculation unit 330 due to the pipeline hazard.

Therefore, in the calculation control unit 350, the pipeline calculation unit 330 is performing the pipeline calculation of the matrix vector product (for example, A11 × vb11) of the first partial matrix and the first partial vector, and the first partial vector or the first partial matrix. Instructs the pipeline calculation unit 330 to execute another matrix vector product operation using. Here, the "other matrix vector product" is an operation that does not use the operation result of the matrix vector product of the first partial matrix and the first partial vector, and is an operation including the matrix vector product, that is, for example, the first part in the matrix vector product. The operation may be such that an operation result other than the matrix-vector product of the matrix and the first partial vector is added. As a result, the calculation control unit 350 has one or a plurality of calculation control units 350 that do not depend on the calculation result of the first calculation between the time when the pipeline calculation unit 330 waits for the calculation result of the first calculation and the time when the second calculation is started. Another matrix vector product is input to the pipeline calculation unit 330, whereby the utilization efficiency of the pipeline calculation unit 330 can be improved.

Further, the arithmetic control unit 350 determines the matrix-vector product of the second sub-matrix and the second sub-vector after the cycle in which the arithmetic result of the matrix-vector product of the first sub-matrix and the first sub-vector becomes available without delay. The pipeline calculation unit 330 may be instructed to execute an operation to be added to the operation result of the matrix vector product of the first partial matrix and the first partial vector. As a result, the operation control unit 350 can prevent a pipeline hazard from occurring in the second operation, and can input another operation of the matrix vector product between the first operation and the second operation. ..

FIG. 4 shows a first example of pipeline processing by the arithmetic unit 300 according to the present embodiment. In the operation shown as cycle 0, the arithmetic control unit 350 instructs the vector storage unit 310 to read the vb11 which is an example of the first subvector, and the matrix storage unit 320 reads the A11 which is an example of the first submatrix. Is instructed, and the matrix vector product of the first submatrix A11 and the first subvector vb11 is calculated and stored in the intermediate register (temporary register) vctmp1 of the pipeline calculation unit 330 as an intermediate vector in the middle of calculation. Is instructed to the pipeline calculation unit 330.

By the start of execution of cycle 1, the vector storage unit 310 has a second vector (as an example) to be multiplied by the first matrix A in addition to vb11 which is an example of the first partial vector and vb21 which is an example of the second partial vector. Of the second plurality of partial vectors vbi2 obtained by dividing vb2), the third partial vector (vb12 as an example) to be multiplied by the first partial matrix A11 is further stored. In this example, the first vector and the second vector are column vectors included in the second matrix B to be multiplied by the first matrix A, for example, the first vector is vb1 and the second vector is vb2. The third partial vector is vb12 to be multiplied by the first submatrix A11 of the second plurality of partial vectors vbi2 obtained by dividing the second vector vb2. Alternatively, the first vector and the second vector may each be separate vectors to be multiplied by the matrix A.

In the operation shown as cycle 1, during the pipeline operation of the matrix-vector product of the first partial matrix and the first partial vector, the arithmetic control unit 350 instructs the vector storage unit 310 to read the third partial vector vb12. Instruct the matrix storage unit 320 to read the first partial matrix A11, and execute the calculation of the matrix vector product of the first partial matrix and the third partial vector as another matrix vector product calculation that does not cause a pipeline hazard. Instruct the pipeline calculation unit 330. In response to this, the pipeline calculation unit 330 executes an operation of storing the matrix vector product of the first submatrix and the third submatrix vector in the intermediate register vctmp2 of the pipeline calculation unit 330 as an intermediate vector in the middle of calculation. .. This operation corresponds to the operation of the first matrix vector product in the third row of FIG. 2, and the matrix vector products of

cycles

0 and 1 are reflected in different result vectors vc11 and vc12. Therefore, since there is no dependency between these operations, the pipeline calculation unit 330 can execute these operations without causing a pipeline hazard.

In the operation shown as cycle 2, the arithmetic control unit 350 instructs the vector storage unit 310 to read the vb21 which is an example of the second subvector, and the matrix storage unit 320 reads the A12 which is an example of the second submatrix. Is instructed, the matrix vector product of the second submatrix A12 and the second subvector vb21 is calculated, and the execution of the operation of adding the operation result vctmp1 of the operation of cycle 0 is instructed to the pipeline calculation unit 330, and the result of the operation is The result storage unit 350 is instructed to store the obtained partial vector vc11. Here, the operation of cycle 2 depends on the operation of cycle 0, but the operation control unit 350 inserts the operation of cycle 1 which does not depend on the operation of cycle 0 between the operations of cycle 0 and cycle 2. It is possible to improve the utilization efficiency of the pipeline of the pipeline calculation unit 330.

By the start of execution of cycle 3, the vector storage unit 310 may further store the fourth subvector to be multiplied by the second submatrix A12 among the second plurality of subvectors. In the operation shown as cycle 3, the arithmetic control unit 350 instructs the vector storage unit 310 to read the vb22 which is an example of the fourth subvector, and the matrix storage unit 320 reads the A12 which is an example of the second submatrix. Is instructed, and the matrix vector product of the second submatrix A12 and the fourth subvector vb22 is calculated, and the execution of the operation of adding the operation result vctmp2 of the operation of cycle 1 is instructed to the pipeline calculation unit 330, and the result of the operation is The main memory 360 is instructed to store the obtained partial vector vc12. Here, the operation of cycle 3 depends on the operation of cycle 1, but the operation control unit 350 inserts the operation of cycle 2 which does not depend on the operation of cycle 1 between the operations of cycle 1 and cycle 3. It is possible to improve the utilization efficiency of the pipeline of the pipeline calculation unit 330.

In the example of this figure, the partial vectors (vc11, vc12) of the first row range (first to fourth rows) in the plurality of column vectors (vc1, vc2) of the matrix C are calculated in cycles 0 to 3, and cycles 4 to 4 to In 7, the partial vectors (vc21, vc22) of the second row range (fifth to eighth rows) in the plurality of column vectors (vc1, vc2) of the matrix C are calculated. The operations of cycles 4 to 7 are the same except that the submatrixes A21 and A22 are used instead of the submatrixes A11 and A12 and the submatrix vc21 and vc22 are used instead of the submatrix vc11 and vc12.

In this example, the operation control unit 350 uses the first submatrix between the first operation of the matrix-vector product of the first submatrix and the first submatrix and the second operation using the operation result. Insert another matrix-vector product operation, that is, in this example, the matrix-vector product operation of the first submatrix and the third submatrix. As a result, the calculation control unit 350 can utilize one free cycle required between the first calculation and the second calculation.

When a plurality of free cycles occur between the first operation and the second operation, the operation control unit 350 sets the matrix vector product of the first submatrix and the plurality of third subvectors in the first operation and the second operation. It may be inserted in between. For example, the vector storage unit 310 further stores a plurality of second vectors vb2, vb3, ... Included in the second matrix B. The arithmetic control unit 350 performs each cycle from the start of the pipeline operation of the matrix vector product of the first partial matrix A11 and the first partial vector vb11 to before the calculation result becomes available without delay in the first portion. The matrix A11 and the matrix vector products A11 × vb12, A11 × vb13, ... Of the third subvectors vb12, vb13, ... From each of the plurality of second vectors vb2, vb3, ... Are filled. The first vector and the plurality of second vectors may be arranged in the column order of the second matrix or the reverse order of the column order, or are not arranged in the column order of the second matrix, and are column vectors of arbitrary columns. It may be.

FIG. 5 shows a second example of pipeline processing by the arithmetic unit 300 according to the present embodiment. When the pipeline calculation unit 330 has more intermediate registers, or when the calculation result is once stored in the main memory 360 and then becomes available, the calculation control unit 350 may be used in cycles 4 to 5 in FIG. The calculation may be controlled to be performed before the calculation in cycles 2 to 3. In this case, the operation control unit 350 uses the first submatrix while executing the pipeline operation of the first operation for calculating the matrix vector product of the first submatrix A11 and the first submatrix vector vb11. The pipeline calculation unit 330 is made to execute the calculation of cycle 1 in FIG. 5, which is the calculation of the product, and the calculation of cycle 2 in FIG. 5, which is the calculation of the other matrix vector product using the first partial vector. Further, the calculation control unit 350 performs the calculation of cycle 3, which is the calculation of the matrix vector product of the second submatrix A21 used for the calculation of cycle 2 and the second partial vector vb12 used for the calculation of cycle 1. It may be executed between the first operation and the second operation. As a result, the calculation control unit 350 can further fill the empty cycle between the first calculation and the second calculation. The execution order of the operations in cycles 0 to 3 may be arbitrary, and the execution order of the operations in cycles 4 to 7 may be determined according to the execution order of the corresponding operations in cycles 0 to 3.

FIG. 6 shows a third example of pipeline processing by the arithmetic unit 300 according to the present embodiment. In the calculation shown as cycle 0, the calculation control unit 350 performs the same control as in cycle 0 of FIG.

By the start of the execution of cycle 1, the matrix storage unit 320 divided the first matrix A in the row direction and the column direction in addition to A11 which is an example of the first submatrix and A12 which is an example of the second submatrix. Of the first plurality of submatrix Aij, the third submatrix (A21 as an example) to be multiplied by the first submatrix vector vb11 is further stored. Instead, the third submatrix may be a submatrix included in a matrix other than the first matrix A.

In the operation shown as cycle 1, during the pipeline operation of the matrix-vector product of the first partial matrix and the first partial vector, the arithmetic control unit 350 instructs the vector storage unit 310 to read the first partial vector vb11. As an operation of another matrix vector product that instructs the matrix storage unit 320 to read the third partial matrix A21 and does not cause a pipeline hazard, the calculation of the matrix vector product of the third partial matrix A21 and the first partial vector vb11. Instruct the pipeline calculation unit 330 to execute. In response to this, the pipeline calculation unit 330 performs an operation to store the matrix vector product of the third submatrix A21 and the first submatrix vector vb11 in the intermediate register vctmp2 of the pipeline calculation unit 330 as an intermediate vector in the middle of calculation. Execute. This operation corresponds to the operation of the first matrix vector product in the second row of FIG. 2, and the matrix vector products of

cycles

0 and 1 are reflected in different result vectors vc11 and vc21. Therefore, since there is no dependency between these operations, the pipeline calculation unit 330 can execute these operations without causing a pipeline hazard.

In the calculation shown as cycle 2, the calculation control unit 350 performs the same control as in cycle 2 of FIG. Here, the operation of cycle 2 depends on the operation of cycle 0, but the operation control unit 350 inserts the operation of cycle 1 which does not depend on the operation of cycle 0 between the operations of cycle 0 and cycle 2. It is possible to improve the utilization efficiency of the pipeline of the pipeline calculation unit 330.

By the start of execution of cycle 3, the vector storage unit 310 may further store the fourth submatrix to be multiplied by the second submatrix vb21 among the first plurality of submatrixes Aij. In the operation shown as cycle 3, the arithmetic control unit 350 instructs the vector storage unit 310 to read the vb21 which is an example of the second subvector, and the matrix storage unit 320 reads the A22 which is an example of the fourth submatrix. Is instructed, and the matrix vector product of the fourth submatrix A22 and the second subvector vb21 is calculated, and the execution of the operation of adding the operation result vctmp2 of the operation of cycle 1 is instructed to the pipeline calculation unit 330, and the result of the operation is The main memory 360 is instructed to store the obtained partial vector vc21. Here, the operation of cycle 3 depends on the operation of cycle 1, but the operation control unit 350 inserts the operation of cycle 2 which does not depend on the operation of cycle 1 between the operations of cycle 1 and cycle 3. It is possible to improve the utilization efficiency of the pipeline of the pipeline calculation unit 330.

In the example of this figure, two subvectors vc11 and vc21 included in one column vector vc1 of the matrix C are calculated in cycles 0 to 3, and 2 included in another column vector vc2 of the matrix C in cycles 4 to 7. Two partial vectors vc12 and vc22 are calculated. The operations of cycles 4 to 7 are the same except that the partial vectors vb12 and vb22 are used instead of the partial vectors vb11 and vb21, and the partial vectors vc12 and vc22 are used instead of the partial vectors vc11 and vc21, and thus the description thereof will be omitted.

In this example, the arithmetic control unit 350 uses the first partial vector between the first operation of the matrix vector product of the first submatrix and the first subvector and the second operation using the operation result. Insert another matrix-vector product operation, that is, in this example, the matrix-vector product operation of the third partial matrix and the first partial vector. As a result, the calculation control unit 350 can utilize one free cycle required between the first calculation and the second calculation.

When a plurality of free cycles occur between the first operation and the second operation, the operation control unit 350 sets the matrix vector product of each of the plurality of third submatrixes and the first submatrix as the first operation and the second operation. It may be inserted in between. For example, the matrix storage unit 320 stores a plurality of third submatrixes A21, A31, ... To be multiplied by the first submatrix vector included in the first matrix A. The arithmetic control unit 350 performs a plurality of cycles from the start of the pipeline operation of the matrix vector product of the first partial matrix A11 and the first partial vector vb11 to before the calculation result becomes available without delay. It is filled by the calculation of the matrix vector product of each of the three partial matrices A21, A31, ... And the first partial vector vb11. The first submatrix and the plurality of third submatrixes may be arranged in the same row range of the first matrix in the column order or the reverse order of the column order, and are not arranged in the column order of the second matrix. Each may be a submatrix of any column range.

FIG. 7 shows a fourth example of pipeline processing by the arithmetic unit 300 according to the present embodiment. When the pipeline calculation unit 330 has more intermediate registers, or when the calculation result is once stored in the main memory 360 and then becomes available, the calculation control unit 350 may be used in cycles 4 to 5 in FIG. The calculation may be controlled to be performed before the calculation in cycles 2 to 3. In this case, the operation control unit 350 uses the first partial vector while executing the pipeline operation of the first operation for calculating the matrix vector product of the first partial matrix A11 and the first partial vector vb11. The pipeline calculation unit 330 is made to execute the calculation of cycle 1 in FIG. 7, which is the calculation of the product, and the calculation of cycle 2 in FIG. 7, which is the calculation of the other matrix vector product using the first submatrix. Further, the calculation control unit 350 performs the calculation of cycle 3, which is the calculation of the matrix vector product of the third partial matrix A21 used for the calculation of cycle 1 and the second partial vector vb12 used for the calculation of cycle 2. It may be executed between the first operation and the second operation. As a result, the calculation control unit 350 can further fill the empty cycle between the first calculation and the second calculation. The execution order of the operations in cycles 0 to 3 may be arbitrary, and the execution order of the operations in cycles 4 to 7 may be determined according to the execution order of the corresponding operations in cycles 0 to 3. Here, the pipeline processing of FIG. 7 is substantially the same as that in which the operations of

cycles

1 and 2 in the pipeline processing of FIG. 5 are exchanged and the operations of

cycles

5 and 6 are exchanged.

In any pipeline processing including the first to fourth examples shown above, the calculation control unit 350 requires the pipeline calculation unit 330 for the partial vector and the submatrix used by the pipeline calculation unit 330. The memory control unit 370 may be instructed to transfer from the main memory 360 to the vector storage unit 310 and the matrix storage unit 320 before. For example, in the example of FIG. 4, the main memory 360 transfers the partial vectors vb11, vb12, vb21, vb22 to the vector storage unit 310 and the submatrixes A11 and A12 to the matrix storage unit 320 before the cycle 0. You may transfer it. Instead, the main memory 360 transfers the partial vectors vb11 and vb12 to the vector storage unit 310 and the submatrix A11 to the matrix storage unit 320 before cycle 0, and before cycle 2, The partial vectors vb21 and vb22 may be transferred to the vector storage unit 310, and the submatrix A12 may be transferred to the matrix storage unit 320.

In the case of the pipeline processing shown in the first example and the second example, the pipeline calculation unit 330 uses the partial vectors vb11, vb12, vb21, and vb22 that are different for each cycle, but the submatrix A11, A12, A21, A22. Is used once every two cycles. Therefore, the matrix storage unit 320 only needs to have a throughput capable of outputting a submatrix once every two cycles, and the power consumption and the circuit scale of the matrix storage unit 320 can be reduced.

In the case of the pipeline processing shown in the third example and the fourth example, the pipeline calculation unit 330 uses submatrixes A11, A12, A21, and A22 that are different for each cycle, but the partial vectors vb11, vb12, vb21, and vb22. Is used once every two cycles. Therefore, the matrix storage unit 320 only needs to have a throughput capable of outputting a partial vector once every two cycles, and the power consumption and the circuit scale of the vector storage unit 310 can be reduced.

The designer of the arithmetic unit 300 or the user who uses the arithmetic unit 300 can execute the pipeline processing so that the circuit scale of the arithmetic unit 300 can be made smaller or the power consumption of the arithmetic unit 300 can be made smaller. May be selected.

Various embodiments of the present invention may be described with reference to flowcharts and block diagrams, wherein the block is (1) a stage of the process in which the operation is performed or (2) a device having a role of performing the operation. May represent a section of. Specific stages and sections are implemented by either a dedicated circuit, a programmable circuit supplied with computer-readable instructions stored on a computer-readable medium, or a processor supplied with computer-readable instructions stored on a computer-readable medium. May be done. Dedicated circuits may include either digital or analog hardware circuits, and may include either integrated circuits (ICs) or discrete circuits. Programmable circuits are memory elements such as logical AND, logical OR, logical XOR, logical NAND, logical NOR, and other logical operations, flip-flops, registers, field programmable gate arrays (FPGA), programmable logic arrays (PLA), etc. May include reconfigurable hardware circuits, including.

The computer-readable medium may include any tangible device capable of storing instructions executed by the appropriate device, so that the computer-readable medium having the instructions stored therein is specified in a flowchart or block diagram. It will be equipped with a product that contains instructions that can be executed to create means for performing the operation. Examples of computer-readable media may include electronic storage media, magnetic storage media, optical storage media, electromagnetic storage media, semiconductor storage media, and the like. More specific examples of computer-readable media include floppy (registered trademark) disks, diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), Electrically erasable programmable read-only memory (EEPROM), static random access memory (SRAM), compact disc read-only memory (CD-ROM), digital versatile disc (DVD), Blu-ray (RTM) disc, memory stick, integrated A circuit card or the like may be included.

Computer-readable instructions are assembler instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state-setting data, or object-oriented programming such as Smalltalk, JAVA®, C ++, etc. Contains either source code or object code written in any combination of one or more programming languages, including languages and traditional procedural programming languages such as the "C" programming language or similar programming languages. Good.

Computer-readable instructions are applied locally or to a processor or programmable circuit of a general purpose computer, special purpose computer, or other programmable data processing device, or to a wide area network (WAN) such as the local area network (LAN), the Internet, etc. ) May be executed to create a means for performing the operation specified in the flowchart or block diagram. Examples of processors include computer processors, processing units, microprocessors, digital signal processors, controllers, microcontrollers and the like.

FIG. 8 shows an example of a computer 2200 in which a plurality of aspects of the present invention may be embodied in whole or in part. The program installed on the computer 2200 may allow the computer 2200 to function as an operation associated with the device according to an embodiment of the present invention or as one or more sections of the device, or the operation or the device. It may be possible to have one or more sections run, or the computer 2200 may be able to run a process according to an embodiment of the invention or a stage of the process. Such a program may be run by the CPU 2212 to cause the computer 2200 to perform certain operations associated with some or all of the blocks in the flowcharts and block diagrams described herein.

The computer 2200 according to this embodiment includes a CPU 2212, a RAM 2214, a graphic controller 2216, and a display device 2218, which are connected to each other by a host controller 2210. The computer 2200 also includes input / output units such as a communication interface 2222, a hard disk drive 2224, a DVD-ROM drive 2226, and an IC card drive, which are connected to the host controller 2210 via an input / output controller 2220. The computer also includes legacy I / O units such as the ROM 2230 and keyboard 2242, which are connected to the I / O controller 2220 via an I / O chip 2240.

The CPU 2212 operates according to the programs stored in the ROM 2230 and the RAM 2214, thereby controlling each unit. The graphic controller 2216 acquires the image data generated by the CPU 2212 in a frame buffer or the like provided in the RAM 2214 or itself so that the image data is displayed on the display device 2218.

The communication interface 2222 communicates with other electronic devices via the network. The hard disk drive 2224 stores programs and data used by the CPU 2212 in the computer 2200. The DVD-ROM drive 2226 reads the program or data from the DVD-ROM 2201 and provides the program or data to the hard disk drive 2224 via the RAM 2214. The IC card drive reads the program and data from the IC card and writes the program and data to the IC card.

The ROM 2230 stores either a boot program or the like executed by the computer 2200 at the time of activation, or a program that depends on the hardware of the computer 2200. The input / output chip 2240 may also connect various input / output units to the input / output controller 2220 via a parallel port, a serial port, a keyboard port, a mouse port, and the like.

The program is provided by a computer-readable medium such as a DVD-ROM 2201 or an IC card. The program is read from a computer-readable medium, installed on a hard disk drive 2224, RAM 2214, or ROM 2230, which is also an example of a computer-readable medium, and executed by the CPU 2212. The information processing described in these programs is read by the computer 2200 and provides a link between the program and the various types of hardware resources described above. The device or method may be configured to perform manipulation or processing of information in accordance with the use of computer 2200.

For example, when communication is executed between the computer 2200 and an external device, the CPU 2212 executes a communication program loaded in the RAM 2214, and performs communication processing on the communication interface 2222 based on the processing described in the communication program. You may order. Under the control of the CPU 2212, the communication interface 2222 reads and reads transmission data stored in a transmission buffer processing area provided in a recording medium such as a RAM 2214, a hard disk drive 2224, a DVD-ROM 2201, or an IC card. The data is transmitted to the network, or the received data received from the network is written to the reception buffer processing area or the like provided on the recording medium.

Further, the CPU 2212 causes the RAM 2214 to read all or necessary parts of a file or database stored in an external recording medium such as a hard disk drive 2224, a DVD-ROM drive 2226 (DVD-ROM2201), or an IC card. Various types of processing may be performed on the data on the RAM 2214. The CPU 2212 then writes back the processed data to an external recording medium.

Various types of information such as various types of programs, data, tables, and databases may be stored in recording media and processed. The CPU 2212 describes various types of operations, information processing, conditional judgment, conditional branching, unconditional branching, and information retrieval described in various parts of the present disclosure with respect to the data read from the RAM 2214, and is specified by the instruction sequence of the program. Various types of processing may be performed, including any of the and replacements, and the results are written back to the RAM 2214. Further, the CPU 2212 may search for information in a file, a database, or the like in the recording medium. For example, when a plurality of entries each having an attribute value of the first attribute associated with the attribute value of the second attribute are stored in the recording medium, the CPU 2212 specifies the attribute value of the first attribute. Search for an entry that matches the condition from the plurality of entries, read the attribute value of the second attribute stored in the entry, and associate it with the first attribute that satisfies the predetermined condition. The attribute value of the second attribute obtained may be acquired.

The program or software module described above may be stored on a computer 2200 or on a computer-readable medium near the computer 2200. Also, a recording medium such as a hard disk or RAM provided within a dedicated communication network or a server system connected to the Internet can be used as a computer-readable medium, thereby providing the program to the computer 2200 over the network. To do.

Although the present invention has been described above using the embodiments, the technical scope of the present invention is not limited to the scope described in the above embodiments. It will be apparent to those skilled in the art that various changes or improvements can be made to the above embodiments. It is clear from the claims that the form with such modifications or improvements may also be included in the technical scope of the invention.

The order of execution of operations, procedures, steps, steps, etc. in the devices, systems, programs, and methods shown in the claims, specifications, and drawings is particularly "before" and "prior to". It should be noted that it can be realized in any order unless the output of the previous process is used in the subsequent process. Even if the claims, the specification, and the operation flow in the drawings are explained using "first", "next", etc. for convenience, it means that it is essential to carry out in this order. is not.

300 Arithmetic logic unit 310 Vector storage unit 320 Matrix storage unit 330 Pipeline calculation unit 340 Result storage unit 350 Arithmetic control unit 360 Main memory 370 Memory control unit 2200 Computer 2201 DVD-ROM
2210 Host controller 2212 CPU
2214 RAM
2216 Graphic controller 2218 Display device 2220 I / O controller 2222 Communication interface 2224 Hard disk drive 2226 DVD-ROM drive 2230 ROM
2240 I / O chip 2242 keyboard

Claims

Of the first plurality of partial vectors obtained by dividing the first vector, a vector storage unit that stores at least the first partial vector and a vector storage unit.
A matrix storage unit that stores at least the first submatrix to be multiplied by the first submatrix among the first plurality of submatrixes obtained by dividing the first matrix to be multiplied by the first vector in the row direction and the column direction.
A pipeline calculation unit capable of executing an operation of adding an intermediate vector to the matrix vector product of the submatrix stored in the matrix storage unit and the partial vector stored in the vector storage unit by the pipeline calculation.
The pipeline calculation unit calculates the first partial vector or another matrix vector product using the first partial matrix during the pipeline calculation of the matrix vector product of the first partial matrix and the first partial vector. An arithmetic device including an arithmetic control unit that instructs the pipeline arithmetic unit to execute the above.
The vector storage unit further stores the second partial vector among the first plurality of partial vectors.
The matrix storage unit further stores the second submatrix to be multiplied by the second submatrix among the first plurality of submatrixes.
After the cycle in which the calculation result of the matrix vector product of the first submatrix and the first subvector becomes available without delay, the arithmetic control unit has the matrix vector product of the second submatrix and the second subvector. The arithmetic apparatus according to claim 1, wherein the pipeline arithmetic unit is instructed to execute an operation of adding the first partial matrix and the matrix vector product of the first partial vector to the calculation result.
The vector storage unit further stores the third partial vector to be multiplied by the first submatrix among the second plurality of partial vectors obtained by dividing the second vector to be multiplied by the first matrix.
During the pipeline operation of the matrix-vector product of the first sub-matrix and the first sub-vector, the arithmetic control unit performs the operation of the other matrix-vector product of the first sub-matrix and the third sub-vector. The arithmetic apparatus according to claim 1 or 2, which instructs the pipeline arithmetic unit to execute an operation of a matrix-vector product.
The arithmetic unit according to claim 3, wherein the first vector and the second vector are column vectors included in the second matrix to be multiplied by the first matrix.
The vector storage unit stores a plurality of the second vectors included in the second matrix, and stores the second vector.
The arithmetic control unit performs each cycle from the start of the pipeline operation of the matrix vector product of the first partial matrix and the first partial vector to before the calculation result becomes available without delay. The arithmetic apparatus according to claim 4, wherein the matrix vector product of the third partial vector from each of the partial matrix and the plurality of second vectors is filled.
The matrix storage unit further stores a third submatrix to be multiplied by the first submatrix among the first plurality of submatrixes.
During the pipeline operation of the matrix-vector product of the first sub-matrix and the first sub-vector, the arithmetic control unit performs the operation of the other matrix-vector product of the third sub-matrix and the first sub-vector. The arithmetic apparatus according to claim 1 or 2, which instructs the pipeline arithmetic unit to execute an arithmetic of a matrix vector product.
The matrix storage unit stores a plurality of the third submatrix,
The arithmetic control unit performs each cycle from the start of the pipeline operation of the matrix vector product of the first partial matrix and the first partial vector to before the calculation result becomes available without delay. The arithmetic apparatus according to claim 6, wherein the matrix vector product of each of the third sub-matrix and the first sub-vector is filled.
The vector storage unit stores at least the first partial vector among the first plurality of partial vectors obtained by dividing the first vector.
The matrix storage unit stores at least the first submatrix to be multiplied by the first submatrix among the first plurality of submatrixes obtained by dividing the first matrix to be multiplied by the first vector in the row direction and the column direction.
The pipeline calculation unit capable of executing an operation of adding an intermediate vector to the matrix vector product of the partial matrix stored in the matrix storage unit and the partial vector stored in the vector storage unit by the pipeline calculation is the first. A calculation method for starting execution of an operation of another matrix vector product using the first partial vector or the first partial matrix during a pipeline calculation of the matrix vector product of the one partial matrix and the first partial vector.
An arithmetic program executed by an arithmetic unit,
The arithmetic unit
Of the first plurality of partial vectors obtained by dividing the first vector, a vector storage unit that stores at least the first partial vector and a vector storage unit.
Of the first plurality of submatrixes obtained by dividing the first matrix to be multiplied by the first vector in the row direction and the column direction, the first submatrix to be multiplied by the first submatrix is at least a matrix storage unit and a matrix storage unit.
A pipeline calculation unit capable of executing an operation of adding an intermediate vector to the matrix vector product of the submatrix stored in the matrix storage unit and the partial vector stored in the vector storage unit by the pipeline calculation is provided.
The arithmetic program supplies the arithmetic apparatus with the first partial vector or another matrix vector using the first partial matrix during a pipeline operation of the matrix vector product of the first partial matrix and the first partial vector. An arithmetic program for initiating the execution of a product operation.