WO2020246598A1 - Calculation device, calculation method, and calculation program - Google Patents

Calculation device, calculation method, and calculation program Download PDF

Info

Publication number
WO2020246598A1
WO2020246598A1 PCT/JP2020/022377 JP2020022377W WO2020246598A1 WO 2020246598 A1 WO2020246598 A1 WO 2020246598A1 JP 2020022377 W JP2020022377 W JP 2020022377W WO 2020246598 A1 WO2020246598 A1 WO 2020246598A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
matrix
partial
storage unit
calculation
Prior art date
Application number
PCT/JP2020/022377
Other languages
French (fr)
Japanese (ja)
Inventor
淳一郎 牧野
戎崎 俊一
Original Assignee
国立研究開発法人理化学研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 国立研究開発法人理化学研究所 filed Critical 国立研究開発法人理化学研究所
Publication of WO2020246598A1 publication Critical patent/WO2020246598A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization

Definitions

  • the present invention relates to an arithmetic unit, an arithmetic method, and an arithmetic program.
  • matrix matrix product (hereinafter referred to as "matrix product”) and matrix vector product occupy most of the calculation amount. Therefore, arithmetic units and arithmetic methods for efficiently executing such matrix operations have been developed (see Patent Documents 1 to 3). Processors capable of performing matrix operations have also been developed.
  • Patent Document 1 International Publication No. 2018/207926
  • Patent Document 2 Japanese Unexamined Patent Publication No. 2018-139045
  • Patent Document 3 Japanese Unexamined Patent Publication No. 2018-197906
  • the matrix-vector product of the n-dimensional square matrix and the n-dimensional vector includes multiplication of n2 and addition of about n2, and has a calculation amount of about 2n2. Therefore, when the n-dimensional square matrix is fixed, the amount of calculation of the matrix vector product is on the order of n2 with respect to the input of the n-dimensional vector. Therefore, if the matrix size is increased and the matrix algorithm is increased, the ratio of the data load amount to the calculation amount can be reduced. However, when the matrix arithmetic unit is made large, the load / store capacity of a register file or the like becomes relatively low, and the processing performance of operations on a matrix having a small size and operations other than the matrix becomes relatively low.
  • an arithmetic unit may include a vector storage unit that stores at least the first partial vector among the first plurality of partial vectors obtained by dividing the first vector.
  • the arithmetic unit has a matrix storage unit that stores at least the first matrix to be multiplied by the first vector among the first plurality of matrix obtained by dividing the first matrix to be multiplied by the first vector in the row direction and the column direction. You may be prepared.
  • the arithmetic unit includes a pipeline arithmetic unit capable of executing an operation of adding an intermediate vector to the matrix vector product of the submatrix stored in the matrix storage unit and the partial vector stored in the vector storage unit by the pipeline operation. You can.
  • the pipeline arithmetic unit performs an operation of another matrix vector product using the first partial vector or the first partial matrix during the pipeline operation of the matrix vector product of the first partial matrix and the first partial vector.
  • An arithmetic control unit that instructs the pipeline arithmetic unit to execute may be provided.
  • the vector storage unit may further store the second partial vector among the first plurality of partial vectors.
  • the matrix storage unit may further store the second submatrix to be multiplied by the second submatrix among the first plurality of submatrixes.
  • the arithmetic control unit performs the matrix-vector product of the second sub-matrix and the second sub-vector after the cycle in which the arithmetic result of the matrix-vector product of the first sub-matrix and the first sub-vector becomes available without delay.
  • the pipeline calculation unit may be instructed to execute an operation to be added to the operation result of the matrix vector product of the matrix and the first partial vector.
  • the vector storage unit may further store the third partial vector to be multiplied by the first submatrix among the second plurality of partial vectors obtained by dividing the second vector to be multiplied by the first matrix.
  • the arithmetic control unit performs the operation of the matrix-vector product of the first sub-matrix and the third sub-vector as the operation of the other matrix-vector product. The execution may be instructed to the pipeline calculation unit.
  • the first vector and the second vector may be column vectors included in the second matrix to be multiplied by the first matrix.
  • the vector storage unit may store a plurality of second vectors included in the second matrix.
  • the arithmetic control unit performs each cycle from the start of the pipeline operation of the matrix vector product of the first submatrix and the first subvector to before the operation result becomes available without delay in the first submatrix and a plurality of cycles. It may be filled by the calculation of the matrix vector product of the third partial vector from each of the second vectors of.
  • the matrix storage unit may further store the third submatrix to be multiplied by the first submatrix among the first plurality of submatrixes.
  • the arithmetic control unit performs the operation of the matrix-vector product of the third sub-matrix and the first sub-vector as the operation of the other matrix-vector product. The execution may be instructed to the pipeline calculation unit.
  • the matrix storage unit may store a plurality of third submatrixes.
  • the arithmetic control unit performs each cycle from the start of the pipeline operation of the matrix vector product of the first submatrix and the first subvector to before the operation result becomes available without delay, and a plurality of third submatrix. Each of the above and the matrix vector product of the first partial vector may be filled.
  • a calculation method may include that the vector storage unit stores at least the first partial vector among the first plurality of partial vectors obtained by dividing the first vector.
  • the matrix storage unit stores at least the first submatrix to be multiplied by the first submatrix among the first plurality of submatrixes obtained by dividing the first matrix to be multiplied by the first vector in the row direction and the column direction. You may be prepared to do.
  • the calculation method is a pipeline calculation unit that can execute an operation to add an intermediate vector to the matrix vector product of the partial matrix stored in the matrix storage unit and the partial vector stored in the vector storage unit by pipeline operation. During the pipeline operation of the matrix-vector product of the first partial matrix and the first partial vector, it may be provided to start executing the operation of another matrix-vector product using the first partial vector or the first partial matrix.
  • an arithmetic program executed by an arithmetic unit may include a vector storage unit that stores at least the first partial vector among the first plurality of partial vectors obtained by dividing the first vector.
  • the arithmetic unit includes at least a matrix storage unit for the first submatrix to be multiplied by the first submatrix among the first plurality of submatrixes obtained by dividing the first matrix to be multiplied by the first vector in the row direction and the column direction. Good.
  • the arithmetic unit includes a pipeline arithmetic unit capable of executing an operation of adding an intermediate vector to the matrix vector product of the submatrix stored in the matrix storage unit and the partial vector stored in the vector storage unit by the pipeline operation. You can.
  • the arithmetic program causes the arithmetic unit to execute an operation of the first partial vector or another matrix vector product using the first partial matrix during the pipeline operation of the matrix vector product of the first partial matrix and the first partial vector. It may be for getting started.
  • FIG. 8 shows an example of a computer 2200 in which a plurality of aspects of the present invention may be embodied in whole or in part.
  • FIG. 1 shows an example of a matrix operation according to this embodiment.
  • the matrices A, B, and C are square matrices of 8 rows and 8 columns.
  • the range of j ⁇ 3 the illustration of each component of bij and cij is omitted.
  • the column vector of the j column of the matrix C is the vector.
  • vector vcj (c1j, c2j, ..., c8j) T.
  • the matrix A is shown as A11, A21, A12, and A22 as sub-matrix obtained by dividing the matrix A into two in the row direction and the column direction, respectively.
  • the partial vectors obtained by dividing the vector vcj into two in the row direction are referred to as vc1j and vc2j.
  • the matrix product is shown as an example of the matrix operation, and the matrix vector product is described as being included in a part of the matrix product.
  • the matrices A, B, and C have powers of 2 elements in the row and column directions, and the matrix A is divided into powers of 2 in the row and column directions. The case will be illustrated.
  • the matrices A, B, and C may have a number of elements other than powers of 2 for at least one in the row or column direction, with the matrix A at least in the row or column direction.
  • One may be divided into numbers other than powers of 2 (eg, 3x3, 5x5, 9x9, 3x5, 5x9, etc.).
  • FIG. 2 shows an example of a calculation formula obtained by decomposing the matrix operation according to the present embodiment into a matrix-vector product of a submatrix and a partial vector.
  • the matrix vector product of the matrix A and the vector vb is the matrix vector product of the partial matrix and the partial vector. It includes d ⁇ d pieces.
  • FIG. 3 shows the configuration of the arithmetic unit 300 according to the present embodiment.
  • the arithmetic unit 300 can execute the matrix vector product of the matrix up to the number of rows and columns specified in the specifications and the vector up to the number of rows specified in the specifications as a unit operation by pipeline operation.
  • the arithmetic unit 300 divides a matrix-vector product of a matrix and a vector larger than a size that can be processed by one unit of operation into a plurality of sets of a matrix-vector product of a partial matrix and a partial vector that can be processed by one unit of operation. calculate.
  • the arithmetic unit 300 includes a vector storage unit 310, a matrix storage unit 320, a pipeline calculation unit 330, a result storage unit 340, an arithmetic control unit 350, a main memory 360, and a memory control unit 370.
  • the vector storage unit 310 stores at least the first partial vector among the first plurality of partial vectors obtained by dividing the first vector.
  • the vector storage unit 310 is a register as an example.
  • the vector storage unit 310 may be another storage device such as a cache memory that can supply the partial vector to the pipeline calculation unit 330 in a pipeline manner.
  • the first vector is an object vector to which at least one submatrix is multiplied by one matrix stored in the matrix storage unit 320.
  • the first vector is larger than the size that the arithmetic unit 300 can process in one unit of arithmetic.
  • the first plurality of partial vectors are obtained by dividing the first vector into a size that can be processed by one unit of operation.
  • the first vector corresponds to any of the vectors vbj (for example, vb1).
  • the vector storage unit 310 may have a sufficient storage area for further storing the second partial vector and the other partial vectors among the first plurality of partial vectors. For example, in the matrix operation of FIG. 1, the vector storage unit 310 may store the partial vector vb1j and the partial vector vb2j.
  • the matrix storage unit 320 stores at least the first submatrix to be multiplied by the first submatrix among the first plurality of submatrixes obtained by dividing the first matrix to be multiplied by the first vector in the row direction and the column direction.
  • the matrix storage unit 320 is a register as an example.
  • the matrix storage unit 320 may be another storage device such as a cache memory that can supply a submatrix to the pipeline calculation unit 330 in a pipeline manner.
  • the first matrix is an object matrix in which at least one partial vector is to be multiplied by the first vector stored in the vector storage unit 310.
  • the first matrix is larger than the size that the arithmetic unit 300 can process in one unit of arithmetic.
  • the first plurality of sub-matrixes are obtained by dividing the first matrix into a size that can be processed by the arithmetic unit 300 by one unit of arithmetic operations.
  • the first matrix corresponds to the matrix A.
  • the matrix storage unit 320 may have a sufficient storage area for further storing the second submatrix to be multiplied by the second submatrix and the other submatrix among the first plurality of submatrixes.
  • the matrix storage unit 320 may store the submatrix A11 to be multiplied by the submatrix vb1j and the submatrix A12 to be multiplied by the submatrix vb2j.
  • the first partial vector and the second partial vector are located in different row ranges in the first vector. Therefore, the first submatrix and the second submatrix are located in different column ranges in the target matrix.
  • the first sub-matrix and the second sub-matrix may be located in the same row range in the target matrix.
  • the pipeline calculation unit 330 is connected to the vector storage unit 310 and the matrix storage unit 320, receives the partial vector of the calculation target stored in the vector storage unit 310 from the vector storage unit 310, and stores the calculation in the matrix storage unit 320.
  • the target submatrix is received from the matrix storage unit 320.
  • the pipeline calculation unit 330 can execute a calculation of adding an intermediate vector to the matrix-vector product of the submatrix and the partial vector to be calculated by the pipeline calculation.
  • the pipeline calculation unit 330 calculates the matrix vector product of the submatrix of 4 rows and 4 columns and the subvector of 4 rows, and adds the intermediate vector of 4 rows to the matrix vector product to obtain the calculation result.
  • the operation of calculating the partial vector (also referred to as “result vector”) can be executed as a unit operation.
  • the fact that it can be executed as a unit of operation means that the pipeline calculation unit 330 sets the matrix-vector product of the sub-matrix and the sub-vector to be calculated in response to a request such as an external instruction or execution of an instruction. It means that the operations that add intermediate vectors are executed together and the result is output.
  • the pipeline calculation unit 330 may have a large number of calculation units so that all the basic operations (for example, multiplication and addition of values) included in this operation can be performed by separate calculation units. The operation of the unit may be performed by the same arithmetic unit.
  • the pipeline calculation unit 330 when the pipeline calculation unit 330 performs a pipeline calculation, it means that the pipeline calculation unit 330 can operate in parallel when the pipeline calculation unit 330 outputs the result after processing in a plurality of stages after the start of the calculation. To do. That is, the pipeline calculation unit 330 can sequentially start other operations in each cycle from the start of a certain operation to the output of the result, if there is no particular obstacle in execution.
  • the pipeline calculation unit 330 inputs a submatrix and a subvector in the first cycle, multiplies the corresponding elements of the submatrix and the subvector in the second cycle, and obtains the result vector in the third cycle.
  • the product calculated in the second cycle may be summed for each element to be included, and the partial vector of the calculation result may be output in the fourth cycle.
  • the pipeline calculation unit 330 can have a pipeline structure having an arbitrary number of stages, if necessary.
  • the result storage unit 340 is connected to the pipeline calculation unit 330.
  • the result storage unit 340 receives and stores the result vector output by the pipeline calculation unit 330.
  • the result vector is, for example, vc11 and vc21 in FIG.
  • the result storage unit 340 is a register as an example.
  • the result storage unit 340 may be another storage device such as a cache memory that can store the partial vector from the pipeline calculation unit 330 in a pipeline manner.
  • the vector storage unit 310, the matrix storage unit 320, and the result storage unit 340 may be implemented as the same storage device.
  • the calculation control unit 350 is connected to the vector storage unit 310, the matrix storage unit 320, the pipeline calculation unit 330, and the result storage unit 340.
  • the arithmetic control unit 350 requests in response to a matrix operation execution request such as receiving an instruction from the outside of the arithmetic unit 300 or decoding a matrix operation instruction during program execution in the arithmetic unit 300.
  • the vector storage unit 310, the matrix storage unit 320, the pipeline calculation unit 330, and the result storage unit 340 are controlled in order to execute the performed matrix operation.
  • the main memory 360 stores the matrix to be calculated and the calculation result.
  • the memory control unit 370 is connected between the vector storage unit 310, the matrix storage unit 320, the result storage unit 340, and the main memory 360.
  • the memory control unit 370 is located between the vector storage unit 310, the matrix storage unit 320, the result storage unit 340, and the main memory 360 in response to an external instruction or a memory access instruction during program execution in the arithmetic unit 300. Data transfer.
  • the memory control unit 370 mainly uses the partial vector specified by the vector load among the partial vectors stored in the main memory 360 in response to the request for the vector load from the main memory 360 to the vector storage unit 310. It is read from the memory 360 and stored in the vector storage unit 310. Further, the memory control unit 370 mainly uses the submatrix designated by the matrix load among the submatrix stored in the main memory 360 in response to the request for the matrix load from the main memory 360 to the matrix storage unit 320. It is read from the memory 360 and stored in the matrix storage unit 320.
  • the memory control unit 370 reads out the matrix or vector of the calculation result stored in the result storage unit 340 in response to the request for the matrix or vector store from the result storage unit 340 to the main memory 360. It is stored in the main memory 360.
  • the main memory 360 may not be provided in addition to the vector storage unit 310 and the matrix storage unit 320, but a relatively large memory that functions as the vector storage unit 310 and the matrix storage unit 320 may be provided.
  • a partial vector and a partial matrix may be supplied to the pipeline calculation unit 330 directly from the memory in a pipeline.
  • the waiting time for the second operation may be reduced to some extent by supplying the result of the first operation to the second operation (bypass or forwarding) without waiting for the result to be written to the register. Can be done.
  • the first operation and the second operation it is difficult to completely eliminate the vacancy generated in the pipeline of the pipeline calculation unit 330 due to the pipeline hazard.
  • the pipeline calculation unit 330 is performing the pipeline calculation of the matrix vector product (for example, A11 ⁇ vb11) of the first partial matrix and the first partial vector, and the first partial vector or the first partial matrix. Instructs the pipeline calculation unit 330 to execute another matrix vector product operation using.
  • the "other matrix vector product” is an operation that does not use the operation result of the matrix vector product of the first partial matrix and the first partial vector, and is an operation including the matrix vector product, that is, for example, the first part in the matrix vector product.
  • the operation may be such that an operation result other than the matrix-vector product of the matrix and the first partial vector is added.
  • the calculation control unit 350 has one or a plurality of calculation control units 350 that do not depend on the calculation result of the first calculation between the time when the pipeline calculation unit 330 waits for the calculation result of the first calculation and the time when the second calculation is started.
  • Another matrix vector product is input to the pipeline calculation unit 330, whereby the utilization efficiency of the pipeline calculation unit 330 can be improved.
  • the arithmetic control unit 350 determines the matrix-vector product of the second sub-matrix and the second sub-vector after the cycle in which the arithmetic result of the matrix-vector product of the first sub-matrix and the first sub-vector becomes available without delay.
  • the pipeline calculation unit 330 may be instructed to execute an operation to be added to the operation result of the matrix vector product of the first partial matrix and the first partial vector.
  • the operation control unit 350 can prevent a pipeline hazard from occurring in the second operation, and can input another operation of the matrix vector product between the first operation and the second operation. ..
  • FIG. 4 shows a first example of pipeline processing by the arithmetic unit 300 according to the present embodiment.
  • the arithmetic control unit 350 instructs the vector storage unit 310 to read the vb11 which is an example of the first subvector, and the matrix storage unit 320 reads the A11 which is an example of the first submatrix. Is instructed, and the matrix vector product of the first submatrix A11 and the first subvector vb11 is calculated and stored in the intermediate register (temporary register) vctmp1 of the pipeline calculation unit 330 as an intermediate vector in the middle of calculation. Is instructed to the pipeline calculation unit 330.
  • the vector storage unit 310 has a second vector (as an example) to be multiplied by the first matrix A in addition to vb11 which is an example of the first partial vector and vb21 which is an example of the second partial vector.
  • the third partial vector (vb12 as an example) to be multiplied by the first partial matrix A11 is further stored.
  • the first vector and the second vector are column vectors included in the second matrix B to be multiplied by the first matrix A, for example, the first vector is vb1 and the second vector is vb2.
  • the third partial vector is vb12 to be multiplied by the first submatrix A11 of the second plurality of partial vectors vbi2 obtained by dividing the second vector vb2.
  • the first vector and the second vector may each be separate vectors to be multiplied by the matrix A.
  • the arithmetic control unit 350 instructs the vector storage unit 310 to read the third partial vector vb12. Instruct the matrix storage unit 320 to read the first partial matrix A11, and execute the calculation of the matrix vector product of the first partial matrix and the third partial vector as another matrix vector product calculation that does not cause a pipeline hazard. Instruct the pipeline calculation unit 330. In response to this, the pipeline calculation unit 330 executes an operation of storing the matrix vector product of the first submatrix and the third submatrix vector in the intermediate register vctmp2 of the pipeline calculation unit 330 as an intermediate vector in the middle of calculation. ..
  • This operation corresponds to the operation of the first matrix vector product in the third row of FIG. 2, and the matrix vector products of cycles 0 and 1 are reflected in different result vectors vc11 and vc12. Therefore, since there is no dependency between these operations, the pipeline calculation unit 330 can execute these operations without causing a pipeline hazard.
  • the arithmetic control unit 350 instructs the vector storage unit 310 to read the vb21 which is an example of the second subvector, and the matrix storage unit 320 reads the A12 which is an example of the second submatrix. Is instructed, the matrix vector product of the second submatrix A12 and the second subvector vb21 is calculated, and the execution of the operation of adding the operation result vctmp1 of the operation of cycle 0 is instructed to the pipeline calculation unit 330, and the result of the operation is The result storage unit 350 is instructed to store the obtained partial vector vc11.
  • cycle 2 depends on the operation of cycle 0, but the operation control unit 350 inserts the operation of cycle 1 which does not depend on the operation of cycle 0 between the operations of cycle 0 and cycle 2. It is possible to improve the utilization efficiency of the pipeline of the pipeline calculation unit 330.
  • the vector storage unit 310 may further store the fourth subvector to be multiplied by the second submatrix A12 among the second plurality of subvectors.
  • the arithmetic control unit 350 instructs the vector storage unit 310 to read the vb22 which is an example of the fourth subvector, and the matrix storage unit 320 reads the A12 which is an example of the second submatrix.
  • the main memory 360 is instructed to store the obtained partial vector vc12.
  • the operation of cycle 3 depends on the operation of cycle 1, but the operation control unit 350 inserts the operation of cycle 2 which does not depend on the operation of cycle 1 between the operations of cycle 1 and cycle 3. It is possible to improve the utilization efficiency of the pipeline of the pipeline calculation unit 330.
  • the partial vectors (vc11, vc12) of the first row range (first to fourth rows) in the plurality of column vectors (vc1, vc2) of the matrix C are calculated in cycles 0 to 3, and cycles 4 to 4 to In 7, the partial vectors (vc21, vc22) of the second row range (fifth to eighth rows) in the plurality of column vectors (vc1, vc2) of the matrix C are calculated.
  • the operations of cycles 4 to 7 are the same except that the submatrixes A21 and A22 are used instead of the submatrixes A11 and A12 and the submatrix vc21 and vc22 are used instead of the submatrix vc11 and vc12.
  • the operation control unit 350 uses the first submatrix between the first operation of the matrix-vector product of the first submatrix and the first submatrix and the second operation using the operation result. Insert another matrix-vector product operation, that is, in this example, the matrix-vector product operation of the first submatrix and the third submatrix. As a result, the calculation control unit 350 can utilize one free cycle required between the first calculation and the second calculation.
  • the operation control unit 350 sets the matrix vector product of the first submatrix and the plurality of third subvectors in the first operation and the second operation. It may be inserted in between.
  • the vector storage unit 310 further stores a plurality of second vectors vb2, vb3, ... Included in the second matrix B.
  • the arithmetic control unit 350 performs each cycle from the start of the pipeline operation of the matrix vector product of the first partial matrix A11 and the first partial vector vb11 to before the calculation result becomes available without delay in the first portion.
  • the first vector and the plurality of second vectors may be arranged in the column order of the second matrix or the reverse order of the column order, or are not arranged in the column order of the second matrix, and are column vectors of arbitrary columns. It may be.
  • FIG. 5 shows a second example of pipeline processing by the arithmetic unit 300 according to the present embodiment.
  • the calculation control unit 350 may be used in cycles 4 to 5 in FIG. The calculation may be controlled to be performed before the calculation in cycles 2 to 3.
  • the operation control unit 350 uses the first submatrix while executing the pipeline operation of the first operation for calculating the matrix vector product of the first submatrix A11 and the first submatrix vector vb11.
  • the pipeline calculation unit 330 is made to execute the calculation of cycle 1 in FIG. 5, which is the calculation of the product, and the calculation of cycle 2 in FIG.
  • the calculation control unit 350 performs the calculation of cycle 3, which is the calculation of the matrix vector product of the second submatrix A21 used for the calculation of cycle 2 and the second partial vector vb12 used for the calculation of cycle 1. It may be executed between the first operation and the second operation. As a result, the calculation control unit 350 can further fill the empty cycle between the first calculation and the second calculation.
  • the execution order of the operations in cycles 0 to 3 may be arbitrary, and the execution order of the operations in cycles 4 to 7 may be determined according to the execution order of the corresponding operations in cycles 0 to 3.
  • FIG. 6 shows a third example of pipeline processing by the arithmetic unit 300 according to the present embodiment.
  • the calculation control unit 350 performs the same control as in cycle 0 of FIG.
  • the matrix storage unit 320 divided the first matrix A in the row direction and the column direction in addition to A11 which is an example of the first submatrix and A12 which is an example of the second submatrix.
  • the third submatrix (A21 as an example) to be multiplied by the first submatrix vector vb11 is further stored.
  • the third submatrix may be a submatrix included in a matrix other than the first matrix A.
  • the arithmetic control unit 350 instructs the vector storage unit 310 to read the first partial vector vb11.
  • the calculation of the matrix vector product of the third partial matrix A21 and the first partial vector vb11. Instruct the pipeline calculation unit 330 to execute.
  • the pipeline calculation unit 330 performs an operation to store the matrix vector product of the third submatrix A21 and the first submatrix vector vb11 in the intermediate register vctmp2 of the pipeline calculation unit 330 as an intermediate vector in the middle of calculation. Execute. This operation corresponds to the operation of the first matrix vector product in the second row of FIG. 2, and the matrix vector products of cycles 0 and 1 are reflected in different result vectors vc11 and vc21. Therefore, since there is no dependency between these operations, the pipeline calculation unit 330 can execute these operations without causing a pipeline hazard.
  • the calculation control unit 350 performs the same control as in cycle 2 of FIG.
  • the operation of cycle 2 depends on the operation of cycle 0, but the operation control unit 350 inserts the operation of cycle 1 which does not depend on the operation of cycle 0 between the operations of cycle 0 and cycle 2. It is possible to improve the utilization efficiency of the pipeline of the pipeline calculation unit 330.
  • the vector storage unit 310 may further store the fourth submatrix to be multiplied by the second submatrix vb21 among the first plurality of submatrixes Aij.
  • the arithmetic control unit 350 instructs the vector storage unit 310 to read the vb21 which is an example of the second subvector, and the matrix storage unit 320 reads the A22 which is an example of the fourth submatrix.
  • the main memory 360 is instructed to store the obtained partial vector vc21.
  • the operation of cycle 3 depends on the operation of cycle 1, but the operation control unit 350 inserts the operation of cycle 2 which does not depend on the operation of cycle 1 between the operations of cycle 1 and cycle 3. It is possible to improve the utilization efficiency of the pipeline of the pipeline calculation unit 330.
  • two subvectors vc11 and vc21 included in one column vector vc1 of the matrix C are calculated in cycles 0 to 3, and 2 included in another column vector vc2 of the matrix C in cycles 4 to 7.
  • Two partial vectors vc12 and vc22 are calculated.
  • the operations of cycles 4 to 7 are the same except that the partial vectors vb12 and vb22 are used instead of the partial vectors vb11 and vb21, and the partial vectors vc12 and vc22 are used instead of the partial vectors vc11 and vc21, and thus the description thereof will be omitted.
  • the arithmetic control unit 350 uses the first partial vector between the first operation of the matrix vector product of the first submatrix and the first subvector and the second operation using the operation result. Insert another matrix-vector product operation, that is, in this example, the matrix-vector product operation of the third partial matrix and the first partial vector. As a result, the calculation control unit 350 can utilize one free cycle required between the first calculation and the second calculation.
  • the operation control unit 350 sets the matrix vector product of each of the plurality of third submatrixes and the first submatrix as the first operation and the second operation. It may be inserted in between.
  • the matrix storage unit 320 stores a plurality of third submatrixes A21, A31, ... To be multiplied by the first submatrix vector included in the first matrix A.
  • the arithmetic control unit 350 performs a plurality of cycles from the start of the pipeline operation of the matrix vector product of the first partial matrix A11 and the first partial vector vb11 to before the calculation result becomes available without delay.
  • the first submatrix and the plurality of third submatrixes may be arranged in the same row range of the first matrix in the column order or the reverse order of the column order, and are not arranged in the column order of the second matrix. Each may be a submatrix of any column range.
  • FIG. 7 shows a fourth example of pipeline processing by the arithmetic unit 300 according to the present embodiment.
  • the calculation control unit 350 may be used in cycles 4 to 5 in FIG. The calculation may be controlled to be performed before the calculation in cycles 2 to 3.
  • the operation control unit 350 uses the first partial vector while executing the pipeline operation of the first operation for calculating the matrix vector product of the first partial matrix A11 and the first partial vector vb11.
  • the pipeline calculation unit 330 is made to execute the calculation of cycle 1 in FIG. 7, which is the calculation of the product, and the calculation of cycle 2 in FIG. 7, which is the calculation of the other matrix vector product using the first submatrix.
  • the calculation control unit 350 performs the calculation of cycle 3, which is the calculation of the matrix vector product of the third partial matrix A21 used for the calculation of cycle 1 and the second partial vector vb12 used for the calculation of cycle 2. It may be executed between the first operation and the second operation. As a result, the calculation control unit 350 can further fill the empty cycle between the first calculation and the second calculation.
  • the execution order of the operations in cycles 0 to 3 may be arbitrary, and the execution order of the operations in cycles 4 to 7 may be determined according to the execution order of the corresponding operations in cycles 0 to 3.
  • the pipeline processing of FIG. 7 is substantially the same as that in which the operations of cycles 1 and 2 in the pipeline processing of FIG. 5 are exchanged and the operations of cycles 5 and 6 are exchanged.
  • the calculation control unit 350 requires the pipeline calculation unit 330 for the partial vector and the submatrix used by the pipeline calculation unit 330.
  • the memory control unit 370 may be instructed to transfer from the main memory 360 to the vector storage unit 310 and the matrix storage unit 320 before.
  • the main memory 360 transfers the partial vectors vb11, vb12, vb21, vb22 to the vector storage unit 310 and the submatrixes A11 and A12 to the matrix storage unit 320 before the cycle 0. You may transfer it.
  • the main memory 360 transfers the partial vectors vb11 and vb12 to the vector storage unit 310 and the submatrix A11 to the matrix storage unit 320 before cycle 0, and before cycle 2,
  • the partial vectors vb21 and vb22 may be transferred to the vector storage unit 310, and the submatrix A12 may be transferred to the matrix storage unit 320.
  • the pipeline calculation unit 330 uses the partial vectors vb11, vb12, vb21, and vb22 that are different for each cycle, but the submatrix A11, A12, A21, A22. Is used once every two cycles. Therefore, the matrix storage unit 320 only needs to have a throughput capable of outputting a submatrix once every two cycles, and the power consumption and the circuit scale of the matrix storage unit 320 can be reduced.
  • the pipeline calculation unit 330 uses submatrixes A11, A12, A21, and A22 that are different for each cycle, but the partial vectors vb11, vb12, vb21, and vb22. Is used once every two cycles. Therefore, the matrix storage unit 320 only needs to have a throughput capable of outputting a partial vector once every two cycles, and the power consumption and the circuit scale of the vector storage unit 310 can be reduced.
  • the designer of the arithmetic unit 300 or the user who uses the arithmetic unit 300 can execute the pipeline processing so that the circuit scale of the arithmetic unit 300 can be made smaller or the power consumption of the arithmetic unit 300 can be made smaller. May be selected.
  • Various embodiments of the present invention may be described with reference to flowcharts and block diagrams, wherein the block is (1) a stage of the process in which the operation is performed or (2) a device having a role of performing the operation. May represent a section of. Specific stages and sections are implemented by either a dedicated circuit, a programmable circuit supplied with computer-readable instructions stored on a computer-readable medium, or a processor supplied with computer-readable instructions stored on a computer-readable medium. May be done.
  • Dedicated circuits may include either digital or analog hardware circuits, and may include either integrated circuits (ICs) or discrete circuits.
  • Programmable circuits are memory elements such as logical AND, logical OR, logical XOR, logical NAND, logical NOR, and other logical operations, flip-flops, registers, field programmable gate arrays (FPGA), programmable logic arrays (PLA), etc. May include reconfigurable hardware circuits, including.
  • the computer-readable medium may include any tangible device capable of storing instructions executed by the appropriate device, so that the computer-readable medium having the instructions stored therein is specified in a flowchart or block diagram. It will be equipped with a product that contains instructions that can be executed to create means for performing the operation. Examples of computer-readable media may include electronic storage media, magnetic storage media, optical storage media, electromagnetic storage media, semiconductor storage media, and the like.
  • Computer-readable media include floppy (registered trademark) disks, diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), Electrically erasable programmable read-only memory (EEPROM), static random access memory (SRAM), compact disc read-only memory (CD-ROM), digital versatile disc (DVD), Blu-ray (RTM) disc, memory stick, integrated A circuit card or the like may be included.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or flash memory erasable programmable read-only memory
  • EEPROM Electrically erasable programmable read-only memory
  • SRAM static random access memory
  • CD-ROM compact disc read-only memory
  • DVD digital versatile disc
  • RTM Blu-ray
  • Computer-readable instructions are assembler instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state-setting data, or object-oriented programming such as Smalltalk, JAVA®, C ++, etc. Contains either source code or object code written in any combination of one or more programming languages, including languages and traditional procedural programming languages such as the "C" programming language or similar programming languages. Good.
  • Computer-readable instructions are applied locally or to a processor or programmable circuit of a general purpose computer, special purpose computer, or other programmable data processing device, or to a wide area network (WAN) such as the local area network (LAN), the Internet, etc. ) May be executed to create a means for performing the operation specified in the flowchart or block diagram.
  • processors include computer processors, processing units, microprocessors, digital signal processors, controllers, microcontrollers and the like.
  • FIG. 8 shows an example of a computer 2200 in which a plurality of aspects of the present invention may be embodied in whole or in part.
  • the program installed on the computer 2200 may allow the computer 2200 to function as an operation associated with the device according to an embodiment of the present invention or as one or more sections of the device, or the operation or the device. It may be possible to have one or more sections run, or the computer 2200 may be able to run a process according to an embodiment of the invention or a stage of the process.
  • Such a program may be run by the CPU 2212 to cause the computer 2200 to perform certain operations associated with some or all of the blocks in the flowcharts and block diagrams described herein.
  • the computer 2200 includes a CPU 2212, a RAM 2214, a graphic controller 2216, and a display device 2218, which are connected to each other by a host controller 2210.
  • the computer 2200 also includes input / output units such as a communication interface 2222, a hard disk drive 2224, a DVD-ROM drive 2226, and an IC card drive, which are connected to the host controller 2210 via an input / output controller 2220.
  • the computer also includes legacy I / O units such as the ROM 2230 and keyboard 2242, which are connected to the I / O controller 2220 via an I / O chip 2240.
  • the CPU 2212 operates according to the programs stored in the ROM 2230 and the RAM 2214, thereby controlling each unit.
  • the graphic controller 2216 acquires the image data generated by the CPU 2212 in a frame buffer or the like provided in the RAM 2214 or itself so that the image data is displayed on the display device 2218.
  • the communication interface 2222 communicates with other electronic devices via the network.
  • the hard disk drive 2224 stores programs and data used by the CPU 2212 in the computer 2200.
  • the DVD-ROM drive 2226 reads the program or data from the DVD-ROM 2201 and provides the program or data to the hard disk drive 2224 via the RAM 2214.
  • the IC card drive reads the program and data from the IC card and writes the program and data to the IC card.
  • the ROM 2230 stores either a boot program or the like executed by the computer 2200 at the time of activation, or a program that depends on the hardware of the computer 2200.
  • the input / output chip 2240 may also connect various input / output units to the input / output controller 2220 via a parallel port, a serial port, a keyboard port, a mouse port, and the like.
  • the program is provided by a computer-readable medium such as a DVD-ROM 2201 or an IC card.
  • the program is read from a computer-readable medium, installed on a hard disk drive 2224, RAM 2214, or ROM 2230, which is also an example of a computer-readable medium, and executed by the CPU 2212.
  • the information processing described in these programs is read by the computer 2200 and provides a link between the program and the various types of hardware resources described above.
  • the device or method may be configured to perform manipulation or processing of information in accordance with the use of computer 2200.
  • the CPU 2212 executes a communication program loaded in the RAM 2214, and performs communication processing on the communication interface 2222 based on the processing described in the communication program. You may order.
  • the communication interface 2222 reads and reads transmission data stored in a transmission buffer processing area provided in a recording medium such as a RAM 2214, a hard disk drive 2224, a DVD-ROM 2201, or an IC card. The data is transmitted to the network, or the received data received from the network is written to the reception buffer processing area or the like provided on the recording medium.
  • the CPU 2212 causes the RAM 2214 to read all or necessary parts of a file or database stored in an external recording medium such as a hard disk drive 2224, a DVD-ROM drive 2226 (DVD-ROM2201), or an IC card. Various types of processing may be performed on the data on the RAM 2214. The CPU 2212 then writes back the processed data to an external recording medium.
  • an external recording medium such as a hard disk drive 2224, a DVD-ROM drive 2226 (DVD-ROM2201), or an IC card.
  • Various types of processing may be performed on the data on the RAM 2214.
  • the CPU 2212 then writes back the processed data to an external recording medium.
  • the CPU 2212 describes various types of operations, information processing, conditional judgment, conditional branching, unconditional branching, and information retrieval described in various parts of the present disclosure with respect to the data read from the RAM 2214, and is specified by the instruction sequence of the program. Various types of processing may be performed, including any of the and replacements, and the results are written back to the RAM 2214. Further, the CPU 2212 may search for information in a file, a database, or the like in the recording medium.
  • the CPU 2212 specifies the attribute value of the first attribute. Search for an entry that matches the condition from the plurality of entries, read the attribute value of the second attribute stored in the entry, and associate it with the first attribute that satisfies the predetermined condition.
  • the attribute value of the second attribute obtained may be acquired.
  • the program or software module described above may be stored on a computer 2200 or on a computer-readable medium near the computer 2200.
  • a recording medium such as a hard disk or RAM provided within a dedicated communication network or a server system connected to the Internet can be used as a computer-readable medium, thereby providing the program to the computer 2200 over the network. To do.
  • Arithmetic logic unit 310
  • Vector storage unit 320
  • Matrix storage unit 330
  • Pipeline calculation unit 340
  • Result storage unit 350
  • Arithmetic control unit 360
  • Main memory 370
  • Memory control unit 2200 Computer 2201 DVD-ROM 2210
  • Host controller 2212 CPU 2214
  • RAM 2216
  • Graphic controller 2218 Display device 2220 I / O controller 2222

Abstract

Provided is a calculation device comprising: a vector storage unit which stores, among a plurality of first partial vectors obtained by dividing a first vector, at least a first partial vector; a matrix storage unit which stores, among a plurality of first submatrixes obtained by dividing a first matrix to be multiplied by the first vector in the row direction and the column direction, at least a first submatrix to be multiplied by the first partial vector; a pipeline calculation unit which, through pipeline calculation, executes calculation for adding an intermediate vector to a matrix vector product of the submatrix stored in the matrix storage unit and the partial vector stored in the vector storage unit; and a calculation control unit which while the pipeline calculation unit executes the pipeline calculation of the matrix vector product of the first submatrix and the first partial vector, instructs the pipeline calculation unit to execute the calculation of another matrix vector product using the first partial vector or the first submatrix.

Description

演算装置、演算方法、および演算プログラムArithmetic logic unit, arithmetic method, and arithmetic program
 本発明は、演算装置、演算方法、および演算プログラムに関する。 The present invention relates to an arithmetic unit, an arithmetic method, and an arithmetic program.
 例えば数値計算および深層学習といった種々の応用において、行列行列積(以下、「行列積」と示す。)および行列ベクトル積は、計算量の大部分を占める。このため、このような行列演算を効率良く実行する演算装置および演算方法が開発されている(特許文献1~3参照)。また、行列演算を実行可能なプロセッサも開発されている。
[先行技術文献]
[特許文献]
  [特許文献1]国際公開第2018/207926号
  [特許文献2]特開2018-139045号公報
  [特許文献3]特開2018-197906号公報
In various applications such as numerical calculation and deep learning, matrix matrix product (hereinafter referred to as "matrix product") and matrix vector product occupy most of the calculation amount. Therefore, arithmetic units and arithmetic methods for efficiently executing such matrix operations have been developed (see Patent Documents 1 to 3). Processors capable of performing matrix operations have also been developed.
[Prior art literature]
[Patent Document]
[Patent Document 1] International Publication No. 2018/207926 [Patent Document 2] Japanese Unexamined Patent Publication No. 2018-139045 [Patent Document 3] Japanese Unexamined Patent Publication No. 2018-197906
解決しようとする課題Problems to be solved
 n次元正方行列およびn次元ベクトルの行列ベクトル積は、n2の乗算および約n2の加算を含み、約2n2の演算量となる。このため、n次元正方行列が固定である場合、行列ベクトル積の演算量は、n次元ベクトルの入力に対してn2オーダーとなる。したがって、行列サイズを大きくして行列演算器を大きくすれば、演算量に対するデータのロード量の比率を小さくすることができる。しかし、行列演算器を大きくすると、レジスタファイル等のロード/ストア能力が相対的に低くなり、サイズが小さい行列の演算および行列以外の演算の処理性能が相対的に低くなってしまう。 The matrix-vector product of the n-dimensional square matrix and the n-dimensional vector includes multiplication of n2 and addition of about n2, and has a calculation amount of about 2n2. Therefore, when the n-dimensional square matrix is fixed, the amount of calculation of the matrix vector product is on the order of n2 with respect to the input of the n-dimensional vector. Therefore, if the matrix size is increased and the matrix algorithm is increased, the ratio of the data load amount to the calculation amount can be reduced. However, when the matrix arithmetic unit is made large, the load / store capacity of a register file or the like becomes relatively low, and the processing performance of operations on a matrix having a small size and operations other than the matrix becomes relatively low.
一般的開示General disclosure
 本発明の第1の態様においては、演算装置を提供する。演算装置は、第1ベクトルを分割した第1の複数の部分ベクトルのうち、第1部分ベクトルを少なくとも記憶するベクトル記憶部を備えてよい。演算装置は、第1ベクトルに乗じる第1行列を行方向および列方向に分割した第1の複数の部分行列のうち、第1部分ベクトルに乗じるべき第1部分行列を少なくとも記憶する行列記憶部を備えてよい。演算装置は、パイプライン演算により、行列記憶部に記憶された部分行列とベクトル記憶部に記憶された部分ベクトルとの行列ベクトル積に、中間ベクトルを加える演算を実行可能なパイプライン演算部を備えてよい。演算装置は、パイプライン演算部が、第1部分行列および第1部分ベクトルの行列ベクトル積のパイプライン演算中に、第1部分ベクトルまたは第1部分行列を用いた他の行列ベクトル積の演算の実行をパイプライン演算部に指示する演算制御部を備えてよい。 In the first aspect of the present invention, an arithmetic unit is provided. The arithmetic unit may include a vector storage unit that stores at least the first partial vector among the first plurality of partial vectors obtained by dividing the first vector. The arithmetic unit has a matrix storage unit that stores at least the first matrix to be multiplied by the first vector among the first plurality of matrix obtained by dividing the first matrix to be multiplied by the first vector in the row direction and the column direction. You may be prepared. The arithmetic unit includes a pipeline arithmetic unit capable of executing an operation of adding an intermediate vector to the matrix vector product of the submatrix stored in the matrix storage unit and the partial vector stored in the vector storage unit by the pipeline operation. You can. In the arithmetic unit, the pipeline arithmetic unit performs an operation of another matrix vector product using the first partial vector or the first partial matrix during the pipeline operation of the matrix vector product of the first partial matrix and the first partial vector. An arithmetic control unit that instructs the pipeline arithmetic unit to execute may be provided.
 ベクトル記憶部は、第1の複数の部分ベクトルのうち、第2部分ベクトルを更に記憶してよい。行列記憶部は、第1の複数の部分行列のうち、第2部分ベクトルに乗じるべき第2部分行列を更に記憶してよい。演算制御部は、第1部分行列および第1部分ベクトルの行列ベクトル積の演算結果が遅延なく利用可能となるサイクル以降に、第2部分行列および第2部分ベクトルの行列ベクトル積を、第1部分行列および第1部分ベクトルの行列ベクトル積の演算結果に加える演算の実行をパイプライン演算部に指示してよい。 The vector storage unit may further store the second partial vector among the first plurality of partial vectors. The matrix storage unit may further store the second submatrix to be multiplied by the second submatrix among the first plurality of submatrixes. The arithmetic control unit performs the matrix-vector product of the second sub-matrix and the second sub-vector after the cycle in which the arithmetic result of the matrix-vector product of the first sub-matrix and the first sub-vector becomes available without delay. The pipeline calculation unit may be instructed to execute an operation to be added to the operation result of the matrix vector product of the matrix and the first partial vector.
 ベクトル記憶部は、第1行列を乗じるべき第2ベクトルを分割した第2の複数の部分ベクトルのうち、第1部分行列を乗じるべき第3部分ベクトルを更に記憶してよい。演算制御部は、第1部分行列および第1部分ベクトルの行列ベクトル積のパイプライン演算中に、他の行列ベクトル積の演算として、第1部分行列および第3部分ベクトルの行列ベクトル積の演算の実行をパイプライン演算部に指示してよい。 The vector storage unit may further store the third partial vector to be multiplied by the first submatrix among the second plurality of partial vectors obtained by dividing the second vector to be multiplied by the first matrix. During the pipeline operation of the matrix-vector product of the first sub-matrix and the first sub-vector, the arithmetic control unit performs the operation of the matrix-vector product of the first sub-matrix and the third sub-vector as the operation of the other matrix-vector product. The execution may be instructed to the pipeline calculation unit.
 第1ベクトルおよび第2ベクトルは、第1行列に乗じるべき第2行列に含まれる列ベクトルであってよい。 The first vector and the second vector may be column vectors included in the second matrix to be multiplied by the first matrix.
 ベクトル記憶部は、第2行列に含まれる複数の第2ベクトルを記憶してよい。演算制御部は、第1部分行列および第1部分ベクトルの行列ベクトル積のパイプライン演算の開始後から演算結果が遅滞なく利用可能となる前までの間の各サイクルを、第1部分行列および複数の第2ベクトルのそれぞれからの第3部分ベクトルの行列ベクトル積の演算で充填してよい。 The vector storage unit may store a plurality of second vectors included in the second matrix. The arithmetic control unit performs each cycle from the start of the pipeline operation of the matrix vector product of the first submatrix and the first subvector to before the operation result becomes available without delay in the first submatrix and a plurality of cycles. It may be filled by the calculation of the matrix vector product of the third partial vector from each of the second vectors of.
 行列記憶部は、第1の複数の部分行列のうち、第1部分ベクトルに乗じるべき第3部分行列を更に記憶してよい。演算制御部は、第1部分行列および第1部分ベクトルの行列ベクトル積のパイプライン演算中に、他の行列ベクトル積の演算として、第3部分行列および第1部分ベクトルの行列ベクトル積の演算の実行をパイプライン演算部に指示してよい。 The matrix storage unit may further store the third submatrix to be multiplied by the first submatrix among the first plurality of submatrixes. During the pipeline operation of the matrix-vector product of the first sub-matrix and the first sub-vector, the arithmetic control unit performs the operation of the matrix-vector product of the third sub-matrix and the first sub-vector as the operation of the other matrix-vector product. The execution may be instructed to the pipeline calculation unit.
 行列記憶部は、複数の第3部分行列を記憶してよい。演算制御部は、第1部分行列および第1部分ベクトルの行列ベクトル積のパイプライン演算の開始後から演算結果が遅滞なく利用可能となる前までの間の各サイクルを、複数の第3部分行列のそれぞれおよび第1部分ベクトルの行列ベクトル積の演算で充填してよい。 The matrix storage unit may store a plurality of third submatrixes. The arithmetic control unit performs each cycle from the start of the pipeline operation of the matrix vector product of the first submatrix and the first subvector to before the operation result becomes available without delay, and a plurality of third submatrix. Each of the above and the matrix vector product of the first partial vector may be filled.
 本発明の第2の態様においては、演算方法を提供する。演算方法は、ベクトル記憶部が、第1ベクトルを分割した第1の複数の部分ベクトルのうち、第1部分ベクトルを少なくとも記憶することを備えてよい。演算方法は、行列記憶部が、第1ベクトルに乗じる第1行列を行方向および列方向に分割した第1の複数の部分行列のうち、第1部分ベクトルに乗じるべき第1部分行列を少なくとも記憶することを備えてよい。演算方法は、パイプライン演算により、行列記憶部に記憶された部分行列とベクトル記憶部に記憶された部分ベクトルとの行列ベクトル積に、中間ベクトルを加える演算を実行可能なパイプライン演算部が、第1部分行列および第1部分ベクトルの行列ベクトル積のパイプライン演算中に、第1部分ベクトルまたは第1部分行列を用いた他の行列ベクトル積の演算の実行を開始することを備えてよい。 In the second aspect of the present invention, a calculation method is provided. The calculation method may include that the vector storage unit stores at least the first partial vector among the first plurality of partial vectors obtained by dividing the first vector. In the calculation method, the matrix storage unit stores at least the first submatrix to be multiplied by the first submatrix among the first plurality of submatrixes obtained by dividing the first matrix to be multiplied by the first vector in the row direction and the column direction. You may be prepared to do. The calculation method is a pipeline calculation unit that can execute an operation to add an intermediate vector to the matrix vector product of the partial matrix stored in the matrix storage unit and the partial vector stored in the vector storage unit by pipeline operation. During the pipeline operation of the matrix-vector product of the first partial matrix and the first partial vector, it may be provided to start executing the operation of another matrix-vector product using the first partial vector or the first partial matrix.
 本発明の第3の態様においては、演算装置によって実行される演算プログラムを提供する。演算装置は、第1ベクトルを分割した第1の複数の部分ベクトルのうち、第1部分ベクトルを少なくとも記憶するベクトル記憶部を備えてよい。演算装置は、第1ベクトルに乗じる第1行列を行方向および列方向に分割した第1の複数の部分行列のうち、第1部分ベクトルに乗じるべき第1部分行列を少なくとも行列記憶部を備えてよい。演算装置は、パイプライン演算により、行列記憶部に記憶された部分行列とベクトル記憶部に記憶された部分ベクトルとの行列ベクトル積に、中間ベクトルを加える演算を実行可能なパイプライン演算部を備えてよい。演算プログラムは、演算装置に、第1部分行列および第1部分ベクトルの行列ベクトル積のパイプライン演算中に、第1部分ベクトルまたは第1部分行列を用いた他の行列ベクトル積の演算の実行を開始させるためのものであってよい。 In the third aspect of the present invention, an arithmetic program executed by an arithmetic unit is provided. The arithmetic unit may include a vector storage unit that stores at least the first partial vector among the first plurality of partial vectors obtained by dividing the first vector. The arithmetic unit includes at least a matrix storage unit for the first submatrix to be multiplied by the first submatrix among the first plurality of submatrixes obtained by dividing the first matrix to be multiplied by the first vector in the row direction and the column direction. Good. The arithmetic unit includes a pipeline arithmetic unit capable of executing an operation of adding an intermediate vector to the matrix vector product of the submatrix stored in the matrix storage unit and the partial vector stored in the vector storage unit by the pipeline operation. You can. The arithmetic program causes the arithmetic unit to execute an operation of the first partial vector or another matrix vector product using the first partial matrix during the pipeline operation of the matrix vector product of the first partial matrix and the first partial vector. It may be for getting started.
 なお、上記の発明の概要は、本発明の必要な特徴の全てを列挙したものではない。また、これらの特徴群のサブコンビネーションもまた、発明となりうる。 The outline of the above invention does not list all the necessary features of the present invention. Sub-combinations of these feature groups can also be inventions.
本実施形態に係る行列演算の一例を示す。An example of the matrix operation according to this embodiment is shown. 本実施形態に係る行列演算を、部分行列および部分ベクトルの行列ベクトル積に分解した計算式の一例を示す。An example of a calculation formula obtained by decomposing the matrix operation according to the present embodiment into a matrix-vector product of a submatrix and a partial vector is shown. 本実施形態に係る演算装置300の構成を示す。The configuration of the arithmetic unit 300 according to this embodiment is shown. 本実施形態に係る演算装置300によるパイプライン処理の第1例を示す。A first example of pipeline processing by the arithmetic unit 300 according to the present embodiment is shown. 本実施形態に係る演算装置300によるパイプライン処理の第2例を示す。A second example of pipeline processing by the arithmetic unit 300 according to the present embodiment is shown. 本実施形態に係る演算装置300によるパイプライン処理の第3例を示す。A third example of pipeline processing by the arithmetic unit 300 according to the present embodiment is shown. 本実施形態に係る演算装置300によるパイプライン処理の第4例を示す。A fourth example of pipeline processing by the arithmetic unit 300 according to the present embodiment is shown. 図8は、本発明の複数の態様が全体的または部分的に具現化されてよいコンピュータ2200の例を示す。FIG. 8 shows an example of a computer 2200 in which a plurality of aspects of the present invention may be embodied in whole or in part.
 以下、発明の実施の形態を通じて本発明を説明するが、以下の実施形態は請求の範囲にかかる発明を限定するものではない。また、実施形態の中で説明されている特徴の組み合わせの全てが発明の解決手段に必須であるとは限らない。 Hereinafter, the present invention will be described through embodiments of the invention, but the following embodiments do not limit the inventions claimed. Also, not all combinations of features described in the embodiments are essential to the means of solving the invention.
 図1は、本実施形態に係る行列演算の一例を示す。本図は、行列Aおよび行列Bの行列積を計算し、行列Cに代入する行列演算C=A×Bを示す。行列A、B、およびCは、8行8列の正方行列である。 FIG. 1 shows an example of a matrix operation according to this embodiment. This figure shows a matrix operation C = A × B that calculates the matrix product of the matrix A and the matrix B and substitutes it into the matrix C. The matrices A, B, and C are square matrices of 8 rows and 8 columns.
 aij(i=1,2,…8、j=1,2,…8)は、行列Aの成分(「要素」とも示す。)である。bij(i=1,2,…8、j=1,2,…8)は、行列Bの成分である。cij(i=1,2,…8、j=1,2,…8)は、行列Cの成分である。j≧3の範囲について、bijおよびcijの各成分の図示は省略している。 Aij (i = 1, 2, ... 8, j = 1, 2, ... 8) is a component (also referred to as "element") of the matrix A. bij (i = 1,2, ... 8, j = 1,2, ... 8) is a component of the matrix B. cij (i = 1,2, ... 8, j = 1,2, ... 8) is a component of the matrix C. Regarding the range of j ≧ 3, the illustration of each component of bij and cij is omitted.
 また、行列Bのj列の列ベクトル、すなわち行列Bのj列の各成分bij(i=1,2,…8)を成分とするベクトルをベクトルvbj、行列Cのj列の列ベクトルをベクトルvcjと示す。すなわち、ベクトルvbj=(b1j,b2j,…,b8j)T、ベクトルvcj=(c1j,c2j,…,c8j)Tと示す。このとき、ベクトルvcjは、行列Aおよびベクトルvbjの行列ベクトル積vcj=A×vbjによって計算できる。 Further, the column vector of the j column of the matrix B, that is, the vector having each component bij (i = 1, 2, ... 8) of the j column of the matrix B as a component is the vector vbj, and the column vector of the j column of the matrix C is the vector. It is shown as vcj. That is, it is shown as vector vbj = (b1j, b2j, ..., b8j) T and vector vcj = (c1j, c2j, ..., c8j) T. At this time, the vector vcj can be calculated by the matrix vector product vcj = A × vbj of the matrix A and the vector vbj.
 ここで、例えば4行4列の行列と4要素のベクトルとの行列ベクトル積を1単位の演算として実行可能な演算装置を用いる場合、行列演算C=A×Bを、演算装置が一度に演算できる単位に分割して行う。本図において、行列Aを、行列Aを行方向および列方向にそれぞれ2分割して得られる部分行列をA11、A21、A12、およびA22と示す。部分行列Amn(m=1,2、n=1,2)は、行列Aにおける、行方向に分割したm番目の行範囲、および列方向に分割したn番目の列範囲の成分を、その部分行列の成分とする。また、ベクトルvbjを、行方向に2分割して得られる部分ベクトルをvb1jおよびvb2jと示す。vbmj(m=1,2)は、ベクトルvbjにおける、行方向に分割したm番目の行範囲の成分を、その部分ベクトルの成分とする。また、ベクトルvcjを、行方向に2分割して得られる部分ベクトルをvc1jおよびvc2jと示す。vcmj(m=1,2)は、ベクトルvcjにおける、行方向に分割したm番目の行範囲の成分を、その部分ベクトルの成分とする。 Here, for example, when using an arithmetic unit capable of executing the matrix-vector product of a 4-by-4 matrix and a 4-element vector as a unit operation, the arithmetic unit calculates the matrix operation C = A × B at once. Divide into units that can be done. In this figure, the matrix A is shown as A11, A21, A12, and A22 as sub-matrix obtained by dividing the matrix A into two in the row direction and the column direction, respectively. The submatrix Amn (m = 1, 2, n = 1, 2) is a portion of the components of the mth row range divided in the row direction and the nth column range divided in the column direction in the matrix A. It is a component of the matrix. Further, the partial vectors obtained by dividing the vector vbj into two in the row direction are shown as vb1j and vb2j. vbmj (m = 1, 2) uses the component of the m-th row range divided in the row direction in the vector vbj as the component of the partial vector. Further, the partial vectors obtained by dividing the vector vcj into two in the row direction are referred to as vc1j and vc2j. For vcmj (m = 1, 2), the component of the m-th row range divided in the row direction in the vector vcj is used as the component of the partial vector.
 なお、本図においては、行列演算の一例として行列積を示し、行列ベクトル積については行列積の一部に含まれるものとして説明した。行列積の演算に含まれない行列ベクトル積については、行列Bおよび行列Cの第1列に関する行列ベクトル積vc1=A×vb1等と同様である。また、本実施形態においては、行列A、B、およびCは、行方向および列方向に2のべき乗個の要素を有し、行列Aが行方向および列方向において2のべき乗個に分割される場合について例示する。これに代えて、行列A、B、およびCは、行方向または列方向の少なくとも1つについて2のべき乗個以外の数の要素を有してもよく、行列Aが行方向または列方向の少なくとも1つについて2のべき乗個以外の数に分割されてもよい(例えば3×3、5×5、9×9、3×5、5×9等)。 In this figure, the matrix product is shown as an example of the matrix operation, and the matrix vector product is described as being included in a part of the matrix product. The matrix vector product not included in the matrix product operation is the same as the matrix vector product vc1 = A × vb1 related to the first column of the matrix B and the matrix C. Also, in this embodiment, the matrices A, B, and C have powers of 2 elements in the row and column directions, and the matrix A is divided into powers of 2 in the row and column directions. The case will be illustrated. Alternatively, the matrices A, B, and C may have a number of elements other than powers of 2 for at least one in the row or column direction, with the matrix A at least in the row or column direction. One may be divided into numbers other than powers of 2 (eg, 3x3, 5x5, 9x9, 3x5, 5x9, etc.).
 図2は、本実施形態に係る行列演算を、部分行列および部分ベクトルの行列ベクトル積に分解した計算式の一例を示す。行列Aおよびベクトルvbjの行列ベクトル積vcj=A×vbjは、部分ベクトルvc1jを計算するvc1j=(A11 A12)×vbj=A11×vb1j+A12×vb2jと、部分ベクトルvc2jを計算するvc2j=(A21 A22)×vbj=A21×vb1j+A22×vb2jとに分けることができる。すなわち、j=1の場合、vc11=A11×vb11+A12×vb21、vc21=A21×vb11+A22×vb21となる。また、j=2の場合、vc12=A11×vb12+A12×vb22、vc22=A21×vb12+A22×vb22となる。以下、j=3,…,8も同様である。 FIG. 2 shows an example of a calculation formula obtained by decomposing the matrix operation according to the present embodiment into a matrix-vector product of a submatrix and a partial vector. The matrix vector product vcj = A × vbj of the matrix A and the vector vbj is vc1j = (A11 A12) × vbj = A11 × vb1j + A12 × vb2j for calculating the partial vector vc1j and vc2j = (A21 A22) for calculating the partial vector vc2j. It can be divided into × vbj = A21 × vb1j + A22 × vb2j. That is, when j = 1, vc11 = A11 × vb11 + A12 × vb21 and vc21 = A21 × vb11 + A22 × vb21. Further, when j = 2, vc12 = A11 × vb12 + A12 × vb22, vc22 = A21 × vb12 + A22 × vb22. Hereinafter, the same applies to j = 3, ..., 8.
 このように、行列Aを行方向および列方向にそれぞれd個に分割し、ベクトルvbをd個に分割すると、行列Aおよびベクトルvbの行列ベクトル積は、部分行列および部分ベクトルの行列ベクトル積をd×d個含むものとなる。演算装置が単一の部分行列を格納可能なレジスタしか有しない場合、演算装置は、部分行列をメモリからレジスタに順次ロードしながら図2に示した行列演算を行うこととなり、処理性能が低下してしまう。 In this way, when the matrix A is divided into d pieces in the row direction and the column direction, and the vector vb is divided into d pieces, the matrix vector product of the matrix A and the vector vb is the matrix vector product of the partial matrix and the partial vector. It includes d × d pieces. When the arithmetic unit has only a register capable of storing a single submatrix, the arithmetic unit performs the matrix operation shown in FIG. 2 while sequentially loading the submatrix from the memory into the register, which deteriorates the processing performance. It ends up.
 図3は、本実施形態に係る演算装置300の構成を示す。演算装置300は、仕様上定められた行数および列数までの行列と、仕様上定められた行数までのベクトルとの行列ベクトル積を1単位の演算としてパイプライン演算により実行可能である。演算装置300は、1単位の演算で処理可能なサイズよりも大きい行列およびベクトルの行列ベクトル積を、1単位の演算で処理可能な部分行列および部分ベクトルの行列ベクトル積の複数組に分割して計算する。 FIG. 3 shows the configuration of the arithmetic unit 300 according to the present embodiment. The arithmetic unit 300 can execute the matrix vector product of the matrix up to the number of rows and columns specified in the specifications and the vector up to the number of rows specified in the specifications as a unit operation by pipeline operation. The arithmetic unit 300 divides a matrix-vector product of a matrix and a vector larger than a size that can be processed by one unit of operation into a plurality of sets of a matrix-vector product of a partial matrix and a partial vector that can be processed by one unit of operation. calculate.
 演算装置300は、ベクトル記憶部310と、行列記憶部320と、パイプライン演算部330と、結果記憶部340と、演算制御部350と、メインメモリ360と、メモリ制御部370とを備える。ベクトル記憶部310は、第1ベクトルを分割した第1の複数の部分ベクトルのうち、第1部分ベクトルを少なくとも記憶する。本実施形態において、ベクトル記憶部310は、一例としてレジスタである。これに代えて、ベクトル記憶部310は、キャッシュメモリ等の、パイプライン的に部分ベクトルをパイプライン演算部330に供給できる他の記憶装置であってもよい。 The arithmetic unit 300 includes a vector storage unit 310, a matrix storage unit 320, a pipeline calculation unit 330, a result storage unit 340, an arithmetic control unit 350, a main memory 360, and a memory control unit 370. The vector storage unit 310 stores at least the first partial vector among the first plurality of partial vectors obtained by dividing the first vector. In the present embodiment, the vector storage unit 310 is a register as an example. Instead of this, the vector storage unit 310 may be another storage device such as a cache memory that can supply the partial vector to the pipeline calculation unit 330 in a pipeline manner.
 ここで、第1ベクトルは、少なくとも1つの部分行列が行列記憶部320に記憶された1行列を乗じる対象となる対象ベクトルである。第1ベクトルは、演算装置300が1単位の演算で処理可能なサイズよりも大きい。第1の複数の部分ベクトルは、第1ベクトルを、1単位の演算で処理可能な大きさに分割したものである。図1の行列演算において、第1ベクトルは、ベクトルvbjのいずれか(例えばvb1)に相当する。第1の複数の部分ベクトルは、第1ベクトルvbjを分割して得られる部分ベクトルvbij(i=1,2)に相当する。第1ベクトルがさらに大きい場合、第1ベクトルは、3以上の部分ベクトルに分割されてもよい。 Here, the first vector is an object vector to which at least one submatrix is multiplied by one matrix stored in the matrix storage unit 320. The first vector is larger than the size that the arithmetic unit 300 can process in one unit of arithmetic. The first plurality of partial vectors are obtained by dividing the first vector into a size that can be processed by one unit of operation. In the matrix operation of FIG. 1, the first vector corresponds to any of the vectors vbj (for example, vb1). The first plurality of partial vectors correspond to the partial vectors vbig (i = 1, 2) obtained by dividing the first vector vbj. If the first vector is larger, the first vector may be divided into three or more subvectors.
 また、ベクトル記憶部310は、第1の複数の部分ベクトルのうち、第2部分ベクトル、およびその他の部分ベクトルを更に記憶するべく、十分な記憶領域を有してもよい。例えば、図1の行列演算において、ベクトル記憶部310は、部分ベクトルvb1jおよび部分ベクトルvb2jを記憶してもよい。 Further, the vector storage unit 310 may have a sufficient storage area for further storing the second partial vector and the other partial vectors among the first plurality of partial vectors. For example, in the matrix operation of FIG. 1, the vector storage unit 310 may store the partial vector vb1j and the partial vector vb2j.
 行列記憶部320は、第1ベクトルに乗じる第1行列を行方向および列方向に分割した第1の複数の部分行列のうち、第1部分ベクトルに乗じるべき第1部分行列を少なくとも記憶する。本実施形態において、行列記憶部320は、一例としてレジスタである。これに代えて、行列記憶部320は、キャッシュメモリ等の、パイプライン的に部分行列をパイプライン演算部330に供給できる他の記憶装置であってもよい。 The matrix storage unit 320 stores at least the first submatrix to be multiplied by the first submatrix among the first plurality of submatrixes obtained by dividing the first matrix to be multiplied by the first vector in the row direction and the column direction. In the present embodiment, the matrix storage unit 320 is a register as an example. Instead of this, the matrix storage unit 320 may be another storage device such as a cache memory that can supply a submatrix to the pipeline calculation unit 330 in a pipeline manner.
 ここで、第1行列は、少なくとも1つの部分ベクトルがベクトル記憶部310に記憶された第1ベクトルに乗じる対象となる対象行列である。第1行列は、演算装置300が1単位の演算で処理可能なサイズよりも大きい。第1の複数の部分行列は、第1行列を、演算装置300が1単位の演算で処理可能な大きさに分割したものである。図1の行列演算において、第1行列は、行列Aに相当する。第1の複数の部分行列は、第1行列Aを行方向および列方向に分割して得られる部分行列Aij(i=1,2、j=1,2)に相当する。第1行列がさらに大きい場合、第1行列は、行方向および列方向のそれぞれにおいて3以上に分割されてもよい。 Here, the first matrix is an object matrix in which at least one partial vector is to be multiplied by the first vector stored in the vector storage unit 310. The first matrix is larger than the size that the arithmetic unit 300 can process in one unit of arithmetic. The first plurality of sub-matrixes are obtained by dividing the first matrix into a size that can be processed by the arithmetic unit 300 by one unit of arithmetic operations. In the matrix operation of FIG. 1, the first matrix corresponds to the matrix A. The first plurality of sub-matrixes correspond to the sub-matrix Aij (i = 1, 2, j = 1, 2) obtained by dividing the first matrix A in the row direction and the column direction. If the first matrix is larger, the first matrix may be divided into three or more in each of the row and column directions.
 また、行列記憶部320は、第1の複数の部分行列のうち、第2部分ベクトルに乗じるべき第2部分行列、およびその他の部分行列を更に記憶するべく、十分な記憶領域を有してもよい。例えば、図1の行列演算において、行列記憶部320は、部分ベクトルvb1jに乗じるべき部分行列A11と、部分ベクトルvb2jに乗じるべき部分行列A12を記憶してもよい。ここで、第1部分ベクトルおよび第2部分ベクトルは、第1ベクトルにおける異なる行範囲に位置する。このため、第1部分行列および第2部分行列は、対象行列における異なる列範囲に位置する。なお、第1部分行列および第2部分行列は、対象行列における同じ行範囲に位置してよい。 Further, the matrix storage unit 320 may have a sufficient storage area for further storing the second submatrix to be multiplied by the second submatrix and the other submatrix among the first plurality of submatrixes. Good. For example, in the matrix operation of FIG. 1, the matrix storage unit 320 may store the submatrix A11 to be multiplied by the submatrix vb1j and the submatrix A12 to be multiplied by the submatrix vb2j. Here, the first partial vector and the second partial vector are located in different row ranges in the first vector. Therefore, the first submatrix and the second submatrix are located in different column ranges in the target matrix. The first sub-matrix and the second sub-matrix may be located in the same row range in the target matrix.
 パイプライン演算部330は、ベクトル記憶部310および行列記憶部320に接続され、ベクトル記憶部310に記憶された演算対象の部分ベクトルをベクトル記憶部310から受け取り、行列記憶部320に記憶された演算対象の部分行列を行列記憶部320から受け取る。パイプライン演算部330は、パイプライン演算により、演算対象の部分行列および部分ベクトルの行列ベクトル積に、中間ベクトルを加える演算を実行可能である。本実施形態において、パイプライン演算部330は、4行4列の部分行列と4行の部分ベクトルとの行列ベクトル積を算出し、この行列ベクトル積に4行の中間ベクトルを加えて演算結果となる部分ベクトル(「結果ベクトル」とも示す。)を算出する演算を1単位の演算として実行可能である。 The pipeline calculation unit 330 is connected to the vector storage unit 310 and the matrix storage unit 320, receives the partial vector of the calculation target stored in the vector storage unit 310 from the vector storage unit 310, and stores the calculation in the matrix storage unit 320. The target submatrix is received from the matrix storage unit 320. The pipeline calculation unit 330 can execute a calculation of adding an intermediate vector to the matrix-vector product of the submatrix and the partial vector to be calculated by the pipeline calculation. In the present embodiment, the pipeline calculation unit 330 calculates the matrix vector product of the submatrix of 4 rows and 4 columns and the subvector of 4 rows, and adds the intermediate vector of 4 rows to the matrix vector product to obtain the calculation result. The operation of calculating the partial vector (also referred to as “result vector”) can be executed as a unit operation.
 ここで、1単位の演算として実行可能とは、パイプライン演算部330が、例えば外部からの指示または命令の実行等の要求に応じて、演算対象の部分行列および部分ベクトルの行列ベクトル積に、中間ベクトルを加える演算をまとめて実行し、その結果を出力することを意味する。パイプライン演算部330は、この演算に含まれる全ての基本演算(例えば、値同士の乗算、加算)を別個の演算器で行うべく多数の演算器を有してもよく、これに代えて一部の演算を同じ演算器で行ってもよい。 Here, the fact that it can be executed as a unit of operation means that the pipeline calculation unit 330 sets the matrix-vector product of the sub-matrix and the sub-vector to be calculated in response to a request such as an external instruction or execution of an instruction. It means that the operations that add intermediate vectors are executed together and the result is output. The pipeline calculation unit 330 may have a large number of calculation units so that all the basic operations (for example, multiplication and addition of values) included in this operation can be performed by separate calculation units. The operation of the unit may be performed by the same arithmetic unit.
 また、パイプライン演算部330がパイプラン演算を行うとは、パイプライン演算部330が演算の開始後複数のステージにおける処理を経て結果を出力するところ、各ステージは並列に動作可能であることを意味する。すなわち、パイプライン演算部330は、ある演算の開始後結果を出力するまでの各サイクルにおいて、特に実行上の障害がなければ順次他の演算を開始することができる。 Further, when the pipeline calculation unit 330 performs a pipeline calculation, it means that the pipeline calculation unit 330 can operate in parallel when the pipeline calculation unit 330 outputs the result after processing in a plurality of stages after the start of the calculation. To do. That is, the pipeline calculation unit 330 can sequentially start other operations in each cycle from the start of a certain operation to the output of the result, if there is no particular obstacle in execution.
 例えば、パイプライン演算部330は、1サイクル目に、部分行列および部分ベクトルを入力し、2サイクル目に、部分行列および部分ベクトルの対応する要素同士を乗算し、3サイクル目に、結果ベクトルに含まれるべき各要素について2サイクル目に計算した積を合計し、4サイクル目に、演算結果の部分ベクトルを出力してもよい。パイプライン演算部330は、必要に応じて任意の段数のパイプライン構造をとることができる。 For example, the pipeline calculation unit 330 inputs a submatrix and a subvector in the first cycle, multiplies the corresponding elements of the submatrix and the subvector in the second cycle, and obtains the result vector in the third cycle. The product calculated in the second cycle may be summed for each element to be included, and the partial vector of the calculation result may be output in the fourth cycle. The pipeline calculation unit 330 can have a pipeline structure having an arbitrary number of stages, if necessary.
 結果記憶部340は、パイプライン演算部330に接続される。結果記憶部340は、パイプライン演算部330が出力する結果ベクトルを受け取って、格納する。結果ベクトルは、例えば図2におけるvc11およびvc21等である。本実施形態において、結果記憶部340は、一例としてレジスタである。これに代えて、結果記憶部340は、キャッシュメモリ等の、パイプライン的にパイプライン演算部330からの部分ベクトルを格納できる他の記憶装置であってもよい。なお、ベクトル記憶部310、行列記憶部320、および結果記憶部340は、同一の記憶装置として実装されてもよい。 The result storage unit 340 is connected to the pipeline calculation unit 330. The result storage unit 340 receives and stores the result vector output by the pipeline calculation unit 330. The result vector is, for example, vc11 and vc21 in FIG. In the present embodiment, the result storage unit 340 is a register as an example. Instead of this, the result storage unit 340 may be another storage device such as a cache memory that can store the partial vector from the pipeline calculation unit 330 in a pipeline manner. The vector storage unit 310, the matrix storage unit 320, and the result storage unit 340 may be implemented as the same storage device.
 演算制御部350は、ベクトル記憶部310、行列記憶部320、パイプライン演算部330、および結果記憶部340に接続される。演算制御部350は、例えば演算装置300の外部からの指示を受けたこと、または演算装置300におけるプログラム実行中に行列演算命令をデコードしたこと等のような行列演算の実行要求に応じて、要求された行列演算を実行するべくベクトル記憶部310、行列記憶部320、パイプライン演算部330、および結果記憶部340を制御する。 The calculation control unit 350 is connected to the vector storage unit 310, the matrix storage unit 320, the pipeline calculation unit 330, and the result storage unit 340. The arithmetic control unit 350 requests in response to a matrix operation execution request such as receiving an instruction from the outside of the arithmetic unit 300 or decoding a matrix operation instruction during program execution in the arithmetic unit 300. The vector storage unit 310, the matrix storage unit 320, the pipeline calculation unit 330, and the result storage unit 340 are controlled in order to execute the performed matrix operation.
 メインメモリ360は、行列演算の対象となる行列および演算結果を格納する。メモリ制御部370は、ベクトル記憶部310、行列記憶部320、および結果記憶部340と、メインメモリ360との間に接続される。メモリ制御部370は、外部からの指示、または演算装置300におけるプログラム実行中のメモリアクセス命令に応じて、ベクトル記憶部310、行列記憶部320、および結果記憶部340と、メインメモリ360との間のデータ転送を行う。 The main memory 360 stores the matrix to be calculated and the calculation result. The memory control unit 370 is connected between the vector storage unit 310, the matrix storage unit 320, the result storage unit 340, and the main memory 360. The memory control unit 370 is located between the vector storage unit 310, the matrix storage unit 320, the result storage unit 340, and the main memory 360 in response to an external instruction or a memory access instruction during program execution in the arithmetic unit 300. Data transfer.
 例えば、メモリ制御部370は、メインメモリ360からベクトル記憶部310へのベクトルロードが要求されたことに応じて、メインメモリ360に記憶された部分ベクトルのうちベクトルロードによって指定された部分ベクトルをメインメモリ360から読み出して、ベクトル記憶部310へと格納する。また、メモリ制御部370は、メインメモリ360から行列記憶部320への行列ロードが要求されたことに応じて、メインメモリ360に記憶された部分行列のうち行列ロードによって指定された部分行列をメインメモリ360から読み出して、行列記憶部320へと格納する。また、メモリ制御部370は、結果記憶部340からメインメモリ360への行列またはベクトルストアが要求されたことに応じて、結果記憶部340に記憶された、演算結果の行列またはベクトルを読み出して、メインメモリ360へと格納する。なお、演算装置300の設計によっては、ベクトル記憶部310および行列記憶部320に加えてメインメモリ360を設けず、ベクトル記憶部310および行列記憶部320として機能する比較的大きいメモリを設けて、当該メモリから直接パイプライン的にパイプライン演算部330に部分ベクトルおよび部分行列を供給できるようにしてもよい。 For example, the memory control unit 370 mainly uses the partial vector specified by the vector load among the partial vectors stored in the main memory 360 in response to the request for the vector load from the main memory 360 to the vector storage unit 310. It is read from the memory 360 and stored in the vector storage unit 310. Further, the memory control unit 370 mainly uses the submatrix designated by the matrix load among the submatrix stored in the main memory 360 in response to the request for the matrix load from the main memory 360 to the matrix storage unit 320. It is read from the memory 360 and stored in the matrix storage unit 320. Further, the memory control unit 370 reads out the matrix or vector of the calculation result stored in the result storage unit 340 in response to the request for the matrix or vector store from the result storage unit 340 to the main memory 360. It is stored in the main memory 360. Depending on the design of the arithmetic unit 300, the main memory 360 may not be provided in addition to the vector storage unit 310 and the matrix storage unit 320, but a relatively large memory that functions as the vector storage unit 310 and the matrix storage unit 320 may be provided. A partial vector and a partial matrix may be supplied to the pipeline calculation unit 330 directly from the memory in a pipeline.
 以上に示した構成において、パイプライン演算部330は、パイプライン処理により、行列ベクトル積の演算を実行する。例えば図2に示したvc11=A11×vb11+A12×vb21の演算を行う場合には、パイプライン演算部330は、第1部分行列A11および第1部分ベクトルvb11の演算を開始した後に演算結果を得るまでに、複数サイクルを要する。このため、パイプライン演算部330は、第1部分行列A11および第1部分ベクトルvb11の行列ベクトル積を計算する第1演算を開始したサイクルの次のサイクルに、第2部分行列A12および第2部分ベクトルvb21の行列ベクトル積を第1演算の結果に加える第2演算が投入されたとしても、第2演算の実行に障害が生じ(パイプラインハザード)、第1演算の演算結果が利用可能となるまで第2演算の処理を待たせる必要が生じてしまう。 In the configuration shown above, the pipeline calculation unit 330 executes the calculation of the matrix vector product by the pipeline processing. For example, when performing the calculation of vc11 = A11 × vb11 + A12 × vb21 shown in FIG. 2, the pipeline calculation unit 330 starts the calculation of the first submatrix A11 and the first submatrix vector vb11 until the calculation result is obtained. It takes multiple cycles. Therefore, the pipeline calculation unit 330 sets the second submatrix A12 and the second part in the cycle following the cycle in which the first operation for calculating the matrix vector product of the first submatrix A11 and the first submatrix vector vb11 is started. Even if the second operation that adds the matrix-vector product of the vector vb21 to the result of the first operation is input, the execution of the second operation is hindered (pipeline hazard), and the operation result of the first operation becomes available. It becomes necessary to wait until the processing of the second calculation.
 なお、パイプラインの設計によっては、第1演算の結果をレジスタに書き込むのを待たずに第2演算へと供給する(バイパスまたはフォワーディング)等により、第2演算の処理待ちをある程度は削減することができる。しかし、第1演算および第2演算の間に依存関係がある以上、パイプラインハザードによってパイプライン演算部330のパイプラインに生じる空きを完全になくすことは難しい。 Depending on the pipeline design, the waiting time for the second operation may be reduced to some extent by supplying the result of the first operation to the second operation (bypass or forwarding) without waiting for the result to be written to the register. Can be done. However, since there is a dependency between the first operation and the second operation, it is difficult to completely eliminate the vacancy generated in the pipeline of the pipeline calculation unit 330 due to the pipeline hazard.
 そこで、演算制御部350は、パイプライン演算部330が、第1部分行列および第1部分ベクトルの行列ベクトル積(例えばA11×vb11)のパイプライン演算中に、第1部分ベクトルまたは第1部分行列を用いた他の行列ベクトル積の演算の実行をパイプライン演算部330に指示する。ここで「他の行列ベクトル積」は、第1部分行列および第1部分ベクトルの行列ベクトル積の演算結果を使用しない演算であり、行列ベクトル積を含む演算、すなわち例えば行列ベクトル積に第1部分行列および第1部分ベクトルの行列ベクトル積以外の演算結果を加えるような演算であってもよい。これにより、演算制御部350は、パイプライン演算部330が第1演算の演算結果を待ってから第2演算を実行開始するまでの間に、第1演算の演算結果に依存しない1または複数の他の行列ベクトル積をパイプライン演算部330へと投入し、これによってパイプライン演算部330の利用効率を高めることができる。 Therefore, in the calculation control unit 350, the pipeline calculation unit 330 is performing the pipeline calculation of the matrix vector product (for example, A11 × vb11) of the first partial matrix and the first partial vector, and the first partial vector or the first partial matrix. Instructs the pipeline calculation unit 330 to execute another matrix vector product operation using. Here, the "other matrix vector product" is an operation that does not use the operation result of the matrix vector product of the first partial matrix and the first partial vector, and is an operation including the matrix vector product, that is, for example, the first part in the matrix vector product. The operation may be such that an operation result other than the matrix-vector product of the matrix and the first partial vector is added. As a result, the calculation control unit 350 has one or a plurality of calculation control units 350 that do not depend on the calculation result of the first calculation between the time when the pipeline calculation unit 330 waits for the calculation result of the first calculation and the time when the second calculation is started. Another matrix vector product is input to the pipeline calculation unit 330, whereby the utilization efficiency of the pipeline calculation unit 330 can be improved.
 さらに、演算制御部350は、第1部分行列および第1部分ベクトルの行列ベクトル積の演算結果が遅延なく利用可能となるサイクル以降に、第2部分行列および第2部分ベクトルの行列ベクトル積を、第1部分行列および第1部分ベクトルの行列ベクトル積の演算結果に加える演算の実行をパイプライン演算部330に指示してもよい。これにより、演算制御部350は、第2演算にパイプラインハザードが生じるのを防ぐことができ、第1演算および第2演算の間に他の行列ベクトル積の演算を投入可能とすることができる。 Further, the arithmetic control unit 350 determines the matrix-vector product of the second sub-matrix and the second sub-vector after the cycle in which the arithmetic result of the matrix-vector product of the first sub-matrix and the first sub-vector becomes available without delay. The pipeline calculation unit 330 may be instructed to execute an operation to be added to the operation result of the matrix vector product of the first partial matrix and the first partial vector. As a result, the operation control unit 350 can prevent a pipeline hazard from occurring in the second operation, and can input another operation of the matrix vector product between the first operation and the second operation. ..
 図4は、本実施形態に係る演算装置300によるパイプライン処理の第1例を示す。サイクル0と示した演算において、演算制御部350は、第1部分ベクトルの一例であるvb11の読み出しをベクトル記憶部310に指示し、第1部分行列の一例であるA11の読み出しを行列記憶部320に指示するとともに、第1部分行列A11および第1部分ベクトルvb11の行列ベクトル積を計算し、計算途中の中間ベクトルとしてパイプライン演算部330が有する中間レジスタ(テンポラリレジスタ)vctmp1に格納する演算の実行をパイプライン演算部330に指示する。 FIG. 4 shows a first example of pipeline processing by the arithmetic unit 300 according to the present embodiment. In the operation shown as cycle 0, the arithmetic control unit 350 instructs the vector storage unit 310 to read the vb11 which is an example of the first subvector, and the matrix storage unit 320 reads the A11 which is an example of the first submatrix. Is instructed, and the matrix vector product of the first submatrix A11 and the first subvector vb11 is calculated and stored in the intermediate register (temporary register) vctmp1 of the pipeline calculation unit 330 as an intermediate vector in the middle of calculation. Is instructed to the pipeline calculation unit 330.
 サイクル1の実行開始までに、ベクトル記憶部310は、第1部分ベクトルの一例であるvb11および第2部分ベクトルの一例であるvb21に加えて、第1行列Aを乗じるべき第2ベクトル(一例としてvb2)を分割した第2の複数の部分ベクトルvbi2のうち、第1部分行列A11を乗じるべき第3部分ベクトル(一例としてvb12)を更に記憶する。本例において、第1ベクトルおよび第2ベクトルは、第1行列Aに乗じるべき第2行列Bに含まれる列ベクトルであり、例えば第1ベクトルはvb1、第2ベクトルはvb2である。第3部分ベクトルは、第2ベクトルvb2を分割した第2の複数の部分ベクトルvbi2のうち第1部分行列A11を乗じるべきvb12である。これに代えて、第1ベクトルおよび第2ベクトルは、それぞれ行列Aを乗じるべき別個のベクトルであってもよい。 By the start of execution of cycle 1, the vector storage unit 310 has a second vector (as an example) to be multiplied by the first matrix A in addition to vb11 which is an example of the first partial vector and vb21 which is an example of the second partial vector. Of the second plurality of partial vectors vbi2 obtained by dividing vb2), the third partial vector (vb12 as an example) to be multiplied by the first partial matrix A11 is further stored. In this example, the first vector and the second vector are column vectors included in the second matrix B to be multiplied by the first matrix A, for example, the first vector is vb1 and the second vector is vb2. The third partial vector is vb12 to be multiplied by the first submatrix A11 of the second plurality of partial vectors vbi2 obtained by dividing the second vector vb2. Alternatively, the first vector and the second vector may each be separate vectors to be multiplied by the matrix A.
 サイクル1と示した演算において、第1部分行列および第1部分ベクトルの行列ベクトル積のパイプライン演算中に、演算制御部350は、第3部分ベクトルvb12の読み出しをベクトル記憶部310に指示し、第1部分行列A11の読み出しを行列記憶部320に指示するとともに、パイプラインハザードを生じさせない他の行列ベクトル積の演算として、第1部分行列および第3部分ベクトルの行列ベクトル積の演算の実行をパイプライン演算部330に指示する。これを受けて、パイプライン演算部330は、第1部分行列および第3部分ベクトルの行列ベクトル積を、計算途中の中間ベクトルとしてパイプライン演算部330が有する中間レジスタvctmp2に格納する演算を実行する。この演算は、図2の第3行目における1つ目の行列ベクトル積の演算に相当し、サイクル0および1の行列ベクトル積は、互いに異なる結果ベクトルvc11およびvc12に反映されるものである。したがって、これらの演算の間に依存関係はないから、パイプライン演算部330は、これらの演算を、パイプラインハザードを発生させることなく実行することができる。 In the operation shown as cycle 1, during the pipeline operation of the matrix-vector product of the first partial matrix and the first partial vector, the arithmetic control unit 350 instructs the vector storage unit 310 to read the third partial vector vb12. Instruct the matrix storage unit 320 to read the first partial matrix A11, and execute the calculation of the matrix vector product of the first partial matrix and the third partial vector as another matrix vector product calculation that does not cause a pipeline hazard. Instruct the pipeline calculation unit 330. In response to this, the pipeline calculation unit 330 executes an operation of storing the matrix vector product of the first submatrix and the third submatrix vector in the intermediate register vctmp2 of the pipeline calculation unit 330 as an intermediate vector in the middle of calculation. .. This operation corresponds to the operation of the first matrix vector product in the third row of FIG. 2, and the matrix vector products of cycles 0 and 1 are reflected in different result vectors vc11 and vc12. Therefore, since there is no dependency between these operations, the pipeline calculation unit 330 can execute these operations without causing a pipeline hazard.
 サイクル2と示した演算において、演算制御部350は、第2部分ベクトルの一例であるvb21の読み出しをベクトル記憶部310に指示し、第2部分行列の一例であるA12の読み出しを行列記憶部320に指示するとともに、第2部分行列A12および第2部分ベクトルvb21の行列ベクトル積を計算し、サイクル0の演算の演算結果vctmp1を加える演算の実行をパイプライン演算部330に指示し、演算の結果得られる部分ベクトルvc11を格納することを結果記憶部350に指示する。ここで、サイクル2の演算はサイクル0の演算に依存するところ、演算制御部350は、サイクル0の演算に依存しないサイクル1の演算をサイクル0およびサイクル2の演算の間に挿入することで、パイプライン演算部330のパイプラインの利用効率を上げることができる。 In the operation shown as cycle 2, the arithmetic control unit 350 instructs the vector storage unit 310 to read the vb21 which is an example of the second subvector, and the matrix storage unit 320 reads the A12 which is an example of the second submatrix. Is instructed, the matrix vector product of the second submatrix A12 and the second subvector vb21 is calculated, and the execution of the operation of adding the operation result vctmp1 of the operation of cycle 0 is instructed to the pipeline calculation unit 330, and the result of the operation is The result storage unit 350 is instructed to store the obtained partial vector vc11. Here, the operation of cycle 2 depends on the operation of cycle 0, but the operation control unit 350 inserts the operation of cycle 1 which does not depend on the operation of cycle 0 between the operations of cycle 0 and cycle 2. It is possible to improve the utilization efficiency of the pipeline of the pipeline calculation unit 330.
 サイクル3の実行開始までに、ベクトル記憶部310は、第2の複数の部分ベクトルのうち、第2部分行列A12を乗じるべき第4部分ベクトルを更に記憶してよい。サイクル3と示した演算において、演算制御部350は、第4部分ベクトルの一例であるvb22の読み出しをベクトル記憶部310に指示し、第2部分行列の一例であるA12の読み出しを行列記憶部320に指示するとともに、第2部分行列A12および第4部分ベクトルvb22の行列ベクトル積を計算し、サイクル1の演算の演算結果vctmp2を加える演算の実行をパイプライン演算部330に指示し、演算の結果得られる部分ベクトルvc12を格納することをメインメモリ360に指示する。ここで、サイクル3の演算はサイクル1の演算に依存するところ、演算制御部350は、サイクル1の演算に依存しないサイクル2の演算をサイクル1およびサイクル3の演算の間に挿入することで、パイプライン演算部330のパイプラインの利用効率を上げることができる。 By the start of execution of cycle 3, the vector storage unit 310 may further store the fourth subvector to be multiplied by the second submatrix A12 among the second plurality of subvectors. In the operation shown as cycle 3, the arithmetic control unit 350 instructs the vector storage unit 310 to read the vb22 which is an example of the fourth subvector, and the matrix storage unit 320 reads the A12 which is an example of the second submatrix. Is instructed, and the matrix vector product of the second submatrix A12 and the fourth subvector vb22 is calculated, and the execution of the operation of adding the operation result vctmp2 of the operation of cycle 1 is instructed to the pipeline calculation unit 330, and the result of the operation is The main memory 360 is instructed to store the obtained partial vector vc12. Here, the operation of cycle 3 depends on the operation of cycle 1, but the operation control unit 350 inserts the operation of cycle 2 which does not depend on the operation of cycle 1 between the operations of cycle 1 and cycle 3. It is possible to improve the utilization efficiency of the pipeline of the pipeline calculation unit 330.
 本図の例では、サイクル0~3において行列Cの複数の列ベクトル(vc1、vc2)における第1行範囲(第1~4行)の部分ベクトル(vc11、vc12)を計算し、サイクル4~7において行列Cの複数の列ベクトル(vc1、vc2)における第2行範囲(第5~8行)の部分ベクトル(vc21、vc22)を計算する。サイクル4~7の演算は、部分行列A11、A12に代えて部分行列A21、A22を用い、部分ベクトルvc11、vc12に代えて部分ベクトルvc21、vc22を用いる他は同様であるので説明を省略する。 In the example of this figure, the partial vectors (vc11, vc12) of the first row range (first to fourth rows) in the plurality of column vectors (vc1, vc2) of the matrix C are calculated in cycles 0 to 3, and cycles 4 to 4 to In 7, the partial vectors (vc21, vc22) of the second row range (fifth to eighth rows) in the plurality of column vectors (vc1, vc2) of the matrix C are calculated. The operations of cycles 4 to 7 are the same except that the submatrixes A21 and A22 are used instead of the submatrixes A11 and A12 and the submatrix vc21 and vc22 are used instead of the submatrix vc11 and vc12.
 本例において、演算制御部350は、第1部分行列および第1部分ベクトルの行列ベクトル積の第1演算と、その演算結果を利用する第2演算との間に、第1部分行列を用いた他の行列ベクトル積の演算、すなわち本例においては第1部分行列および第3部分ベクトルの行列ベクトル積の演算、を挿入する。これによって、演算制御部350は、第1演算および第2演算の間に必要となる空きサイクルを1つ利用することができる。 In this example, the operation control unit 350 uses the first submatrix between the first operation of the matrix-vector product of the first submatrix and the first submatrix and the second operation using the operation result. Insert another matrix-vector product operation, that is, in this example, the matrix-vector product operation of the first submatrix and the third submatrix. As a result, the calculation control unit 350 can utilize one free cycle required between the first calculation and the second calculation.
 第1演算および第2演算の間に複数の空きサイクルが生じる場合、演算制御部350は、第1部分行列および複数の第3部分ベクトルのそれぞれの行列ベクトル積を第1演算および第2演算の間に挿入してよい。例えば、ベクトル記憶部310は、第2行列Bに含まれる複数の第2ベクトルvb2、vb3、…を更に記憶しておく。演算制御部350は、第1部分行列A11および第1部分ベクトルvb11の行列ベクトル積のパイプライン演算の開始後から演算結果が遅滞なく利用可能となる前までの間の各サイクルを、第1部分行列A11および複数の第2ベクトルvb2、vb3、…のそれぞれからの第3部分ベクトルvb12、vb13、…の行列ベクトル積A11×vb12、A11×vb13、…の演算で充填する。なお、第1ベクトルおよび複数の第2ベクトルは、第2行列の列順または列順の逆順に並んでいてもよく、また第2行列の列順に並んでおらず、それぞれ任意の列の列ベクトルであってよい。 When a plurality of free cycles occur between the first operation and the second operation, the operation control unit 350 sets the matrix vector product of the first submatrix and the plurality of third subvectors in the first operation and the second operation. It may be inserted in between. For example, the vector storage unit 310 further stores a plurality of second vectors vb2, vb3, ... Included in the second matrix B. The arithmetic control unit 350 performs each cycle from the start of the pipeline operation of the matrix vector product of the first partial matrix A11 and the first partial vector vb11 to before the calculation result becomes available without delay in the first portion. The matrix A11 and the matrix vector products A11 × vb12, A11 × vb13, ... Of the third subvectors vb12, vb13, ... From each of the plurality of second vectors vb2, vb3, ... Are filled. The first vector and the plurality of second vectors may be arranged in the column order of the second matrix or the reverse order of the column order, or are not arranged in the column order of the second matrix, and are column vectors of arbitrary columns. It may be.
 図5は、本実施形態に係る演算装置300によるパイプライン処理の第2例を示す。パイプライン演算部330がより多くの中間レジスタを有する場合、または演算結果を一旦メインメモリ360に格納した後に利用可能となる場合等においては、演算制御部350は、図4におけるサイクル4~5の演算を、サイクル2~3の演算の前に行うように制御してもよい。この場合、演算制御部350は、第1部分行列A11および第1部分ベクトルvb11の行列ベクトル積を演算する第1演算のパイプライン演算の実行中に、第1部分行列を用いた他の行列ベクトル積の演算である図5中のサイクル1の演算と、第1部分ベクトルを用いた他の行列ベクトル積の演算である図5中のサイクル2の演算とをパイプライン演算部330に実行させる。また、演算制御部350は、サイクル2の演算に用いた第2部分行列A21と、サイクル1の演算に用いた第2部分ベクトルvb12との行列ベクトル積の演算であるサイクル3の演算を、第1演算および第2演算の間に実行させてよい。これにより、演算制御部350は、第1演算および第2演算の間の空きサイクルを更に充填することが可能となる。なお、サイクル0~3の演算同士の実行順序は任意であってよく、サイクル4~7の演算同士の実行順序はサイクル0~3における対応する演算の実行順序に応じて決定されてよい。 FIG. 5 shows a second example of pipeline processing by the arithmetic unit 300 according to the present embodiment. When the pipeline calculation unit 330 has more intermediate registers, or when the calculation result is once stored in the main memory 360 and then becomes available, the calculation control unit 350 may be used in cycles 4 to 5 in FIG. The calculation may be controlled to be performed before the calculation in cycles 2 to 3. In this case, the operation control unit 350 uses the first submatrix while executing the pipeline operation of the first operation for calculating the matrix vector product of the first submatrix A11 and the first submatrix vector vb11. The pipeline calculation unit 330 is made to execute the calculation of cycle 1 in FIG. 5, which is the calculation of the product, and the calculation of cycle 2 in FIG. 5, which is the calculation of the other matrix vector product using the first partial vector. Further, the calculation control unit 350 performs the calculation of cycle 3, which is the calculation of the matrix vector product of the second submatrix A21 used for the calculation of cycle 2 and the second partial vector vb12 used for the calculation of cycle 1. It may be executed between the first operation and the second operation. As a result, the calculation control unit 350 can further fill the empty cycle between the first calculation and the second calculation. The execution order of the operations in cycles 0 to 3 may be arbitrary, and the execution order of the operations in cycles 4 to 7 may be determined according to the execution order of the corresponding operations in cycles 0 to 3.
 図6は、本実施形態に係る演算装置300によるパイプライン処理の第3例を示す。サイクル0と示した演算において、演算制御部350は、図4のサイクル0と同様の制御を行う。 FIG. 6 shows a third example of pipeline processing by the arithmetic unit 300 according to the present embodiment. In the calculation shown as cycle 0, the calculation control unit 350 performs the same control as in cycle 0 of FIG.
 サイクル1の実行開始までに、行列記憶部320は、第1部分行列の一例であるA11および第2部分行列の一例であるA12に加えて、第1行列Aを行方向および列方向に分割した第1の複数の部分行列Aijのうち、第1部分ベクトルvb11に乗じるべき第3部分行列(一例としてA21)を更に記憶する。これに代えて、第3部分行列は、第1行列A以外の行列に含まれる部分行列であってもよい。 By the start of the execution of cycle 1, the matrix storage unit 320 divided the first matrix A in the row direction and the column direction in addition to A11 which is an example of the first submatrix and A12 which is an example of the second submatrix. Of the first plurality of submatrix Aij, the third submatrix (A21 as an example) to be multiplied by the first submatrix vector vb11 is further stored. Instead, the third submatrix may be a submatrix included in a matrix other than the first matrix A.
 サイクル1と示した演算において、第1部分行列および第1部分ベクトルの行列ベクトル積のパイプライン演算中に、演算制御部350は、第1部分ベクトルvb11の読み出しをベクトル記憶部310に指示し、第3部分行列A21の読み出しを行列記憶部320に指示するとともに、パイプラインハザードを生じさせない他の行列ベクトル積の演算として、第3部分行列A21および第1部分ベクトルvb11の行列ベクトル積の演算の実行をパイプライン演算部330に指示する。これを受けて、パイプライン演算部330は、第3部分行列A21および第1部分ベクトルvb11の行列ベクトル積を、計算途中の中間ベクトルとしてパイプライン演算部330が有する中間レジスタvctmp2に格納する演算を実行する。この演算は、図2の第2行目における1つ目の行列ベクトル積の演算に相当し、サイクル0および1の行列ベクトル積は、互いに異なる結果ベクトルvc11およびvc21に反映されるものである。したがって、これらの演算の間に依存関係はないから、パイプライン演算部330は、これらの演算を、パイプラインハザードを発生させることなく実行することができる。 In the operation shown as cycle 1, during the pipeline operation of the matrix-vector product of the first partial matrix and the first partial vector, the arithmetic control unit 350 instructs the vector storage unit 310 to read the first partial vector vb11. As an operation of another matrix vector product that instructs the matrix storage unit 320 to read the third partial matrix A21 and does not cause a pipeline hazard, the calculation of the matrix vector product of the third partial matrix A21 and the first partial vector vb11. Instruct the pipeline calculation unit 330 to execute. In response to this, the pipeline calculation unit 330 performs an operation to store the matrix vector product of the third submatrix A21 and the first submatrix vector vb11 in the intermediate register vctmp2 of the pipeline calculation unit 330 as an intermediate vector in the middle of calculation. Execute. This operation corresponds to the operation of the first matrix vector product in the second row of FIG. 2, and the matrix vector products of cycles 0 and 1 are reflected in different result vectors vc11 and vc21. Therefore, since there is no dependency between these operations, the pipeline calculation unit 330 can execute these operations without causing a pipeline hazard.
 サイクル2と示した演算において、演算制御部350は、図4のサイクル2と同様の制御を行う。ここで、サイクル2の演算はサイクル0の演算に依存するところ、演算制御部350は、サイクル0の演算に依存しないサイクル1の演算をサイクル0およびサイクル2の演算の間に挿入することで、パイプライン演算部330のパイプラインの利用効率を上げることができる。 In the calculation shown as cycle 2, the calculation control unit 350 performs the same control as in cycle 2 of FIG. Here, the operation of cycle 2 depends on the operation of cycle 0, but the operation control unit 350 inserts the operation of cycle 1 which does not depend on the operation of cycle 0 between the operations of cycle 0 and cycle 2. It is possible to improve the utilization efficiency of the pipeline of the pipeline calculation unit 330.
 サイクル3の実行開始までに、ベクトル記憶部310は、第1の複数の部分行列Aijのうち、第2部分ベクトルvb21に乗じるべき第4部分行列を更に記憶してよい。サイクル3と示した演算において、演算制御部350は、第2部分ベクトルの一例であるvb21の読み出しをベクトル記憶部310に指示し、第4部分行列の一例であるA22の読み出しを行列記憶部320に指示するとともに、第4部分行列A22および第2部分ベクトルvb21の行列ベクトル積を計算し、サイクル1の演算の演算結果vctmp2を加える演算の実行をパイプライン演算部330に指示し、演算の結果得られる部分ベクトルvc21を格納することをメインメモリ360に指示する。ここで、サイクル3の演算はサイクル1の演算に依存するところ、演算制御部350は、サイクル1の演算に依存しないサイクル2の演算をサイクル1およびサイクル3の演算の間に挿入することで、パイプライン演算部330のパイプラインの利用効率を上げることができる。 By the start of execution of cycle 3, the vector storage unit 310 may further store the fourth submatrix to be multiplied by the second submatrix vb21 among the first plurality of submatrixes Aij. In the operation shown as cycle 3, the arithmetic control unit 350 instructs the vector storage unit 310 to read the vb21 which is an example of the second subvector, and the matrix storage unit 320 reads the A22 which is an example of the fourth submatrix. Is instructed, and the matrix vector product of the fourth submatrix A22 and the second subvector vb21 is calculated, and the execution of the operation of adding the operation result vctmp2 of the operation of cycle 1 is instructed to the pipeline calculation unit 330, and the result of the operation is The main memory 360 is instructed to store the obtained partial vector vc21. Here, the operation of cycle 3 depends on the operation of cycle 1, but the operation control unit 350 inserts the operation of cycle 2 which does not depend on the operation of cycle 1 between the operations of cycle 1 and cycle 3. It is possible to improve the utilization efficiency of the pipeline of the pipeline calculation unit 330.
 本図の例では、サイクル0~3において行列Cの1つの列ベクトルvc1に含まれる2つの部分ベクトルvc11、vc21を計算し、サイクル4~7において行列Cの別の列ベクトルvc2に含まれる2つの部分ベクトルvc12、vc22を計算する。サイクル4~7の演算は、部分ベクトルvb11、vb21に代えて部分ベクトルvb12、vb22を用い、部分ベクトルvc11、vc21に代えて部分ベクトルvc12、vc22を用いる他は同様であるので説明を省略する。 In the example of this figure, two subvectors vc11 and vc21 included in one column vector vc1 of the matrix C are calculated in cycles 0 to 3, and 2 included in another column vector vc2 of the matrix C in cycles 4 to 7. Two partial vectors vc12 and vc22 are calculated. The operations of cycles 4 to 7 are the same except that the partial vectors vb12 and vb22 are used instead of the partial vectors vb11 and vb21, and the partial vectors vc12 and vc22 are used instead of the partial vectors vc11 and vc21, and thus the description thereof will be omitted.
 本例において、演算制御部350は、第1部分行列および第1部分ベクトルの行列ベクトル積の第1演算と、その演算結果を利用する第2演算との間に、第1部分ベクトルを用いた他の行列ベクトル積の演算、すなわち本例においては第3部分行列および第1部分ベクトルの行列ベクトル積の演算、を挿入する。これによって、演算制御部350は、第1演算および第2演算の間に必要となる空きサイクルを1つ利用することができる。 In this example, the arithmetic control unit 350 uses the first partial vector between the first operation of the matrix vector product of the first submatrix and the first subvector and the second operation using the operation result. Insert another matrix-vector product operation, that is, in this example, the matrix-vector product operation of the third partial matrix and the first partial vector. As a result, the calculation control unit 350 can utilize one free cycle required between the first calculation and the second calculation.
 第1演算および第2演算の間に複数の空きサイクルが生じる場合、演算制御部350は、複数の第3部分行列のそれぞれおよび第1部分ベクトルの行列ベクトル積を第1演算および第2演算の間に挿入してよい。例えば、行列記憶部320は、第1行列Aに含まれる、第1部分ベクトルに乗じるべき複数の第3部分行列A21、A31、…を記憶しておく。演算制御部350は、第1部分行列A11および第1部分ベクトルvb11の行列ベクトル積のパイプライン演算の開始後から演算結果が遅滞なく利用可能となる前までの間の各サイクルを、複数の第3部分行列A21、A31、…のそれぞれおよび第1部分ベクトルvb11の行列ベクトル積の演算で充填する。なお、第1部分行列および複数の第3部分行列は、第1行列の同一の行範囲において列順または列順の逆順に並んでいてもよく、また第2行列の列順に並んでおらず、それぞれ任意の列範囲の部分行列であってよい。 When a plurality of free cycles occur between the first operation and the second operation, the operation control unit 350 sets the matrix vector product of each of the plurality of third submatrixes and the first submatrix as the first operation and the second operation. It may be inserted in between. For example, the matrix storage unit 320 stores a plurality of third submatrixes A21, A31, ... To be multiplied by the first submatrix vector included in the first matrix A. The arithmetic control unit 350 performs a plurality of cycles from the start of the pipeline operation of the matrix vector product of the first partial matrix A11 and the first partial vector vb11 to before the calculation result becomes available without delay. It is filled by the calculation of the matrix vector product of each of the three partial matrices A21, A31, ... And the first partial vector vb11. The first submatrix and the plurality of third submatrixes may be arranged in the same row range of the first matrix in the column order or the reverse order of the column order, and are not arranged in the column order of the second matrix. Each may be a submatrix of any column range.
 図7は、本実施形態に係る演算装置300によるパイプライン処理の第4例を示す。パイプライン演算部330がより多くの中間レジスタを有する場合、または演算結果を一旦メインメモリ360に格納した後に利用可能となる場合等においては、演算制御部350は、図6におけるサイクル4~5の演算を、サイクル2~3の演算の前に行うように制御してもよい。この場合、演算制御部350は、第1部分行列A11および第1部分ベクトルvb11の行列ベクトル積を演算する第1演算のパイプライン演算の実行中に、第1部分ベクトルを用いた他の行列ベクトル積の演算である図7中のサイクル1の演算と、第1部分行列を用いた他の行列ベクトル積の演算である図7中のサイクル2の演算とをパイプライン演算部330に実行させる。また、演算制御部350は、サイクル1の演算に用いた第3部分行列A21と、サイクル2の演算に用いた第2部分ベクトルvb12との行列ベクトル積の演算であるサイクル3の演算を、第1演算および第2演算の間に実行させてよい。これにより、演算制御部350は、第1演算および第2演算の間の空きサイクルを更に充填することが可能となる。なお、サイクル0~3の演算同士の実行順序は任意であってよく、サイクル4~7の演算同士の実行順序はサイクル0~3における対応する演算の実行順序に応じて決定されてよい。ここで、図7のパイプライン処理は、図5のパイプライン処理におけるサイクル1および2の演算を入れ換え、サイクル5および6の演算を入れ換えたものと実質的に同一である。 FIG. 7 shows a fourth example of pipeline processing by the arithmetic unit 300 according to the present embodiment. When the pipeline calculation unit 330 has more intermediate registers, or when the calculation result is once stored in the main memory 360 and then becomes available, the calculation control unit 350 may be used in cycles 4 to 5 in FIG. The calculation may be controlled to be performed before the calculation in cycles 2 to 3. In this case, the operation control unit 350 uses the first partial vector while executing the pipeline operation of the first operation for calculating the matrix vector product of the first partial matrix A11 and the first partial vector vb11. The pipeline calculation unit 330 is made to execute the calculation of cycle 1 in FIG. 7, which is the calculation of the product, and the calculation of cycle 2 in FIG. 7, which is the calculation of the other matrix vector product using the first submatrix. Further, the calculation control unit 350 performs the calculation of cycle 3, which is the calculation of the matrix vector product of the third partial matrix A21 used for the calculation of cycle 1 and the second partial vector vb12 used for the calculation of cycle 2. It may be executed between the first operation and the second operation. As a result, the calculation control unit 350 can further fill the empty cycle between the first calculation and the second calculation. The execution order of the operations in cycles 0 to 3 may be arbitrary, and the execution order of the operations in cycles 4 to 7 may be determined according to the execution order of the corresponding operations in cycles 0 to 3. Here, the pipeline processing of FIG. 7 is substantially the same as that in which the operations of cycles 1 and 2 in the pipeline processing of FIG. 5 are exchanged and the operations of cycles 5 and 6 are exchanged.
 以上に示した第1例から第4例を含む任意のパイプライン処理において、演算制御部350は、パイプライン演算部330が使用する部分ベクトルおよび部分行列を、パイプライン演算部330が必要とするよりも前にメインメモリ360からベクトル記憶部310および行列記憶部320へと転送するようにメモリ制御部370に指示してよい。例えば、図4の例において、メインメモリ360は、サイクル0の前に、部分ベクトルvb11、vb12、vb21、vb22をベクトル記憶部310へと転送し、部分行列A11およびA12を行列記憶部320へと転送してもよい。これに代えて、メインメモリ360は、サイクル0の前に、部分ベクトルvb11、vb12をベクトル記憶部310へと転送し、部分行列A11を行列記憶部320へと転送し、サイクル2の前に、部分ベクトルvb21、vb22をベクトル記憶部310へと転送し、部分行列A12を行列記憶部320へと転送してもよい。 In any pipeline processing including the first to fourth examples shown above, the calculation control unit 350 requires the pipeline calculation unit 330 for the partial vector and the submatrix used by the pipeline calculation unit 330. The memory control unit 370 may be instructed to transfer from the main memory 360 to the vector storage unit 310 and the matrix storage unit 320 before. For example, in the example of FIG. 4, the main memory 360 transfers the partial vectors vb11, vb12, vb21, vb22 to the vector storage unit 310 and the submatrixes A11 and A12 to the matrix storage unit 320 before the cycle 0. You may transfer it. Instead, the main memory 360 transfers the partial vectors vb11 and vb12 to the vector storage unit 310 and the submatrix A11 to the matrix storage unit 320 before cycle 0, and before cycle 2, The partial vectors vb21 and vb22 may be transferred to the vector storage unit 310, and the submatrix A12 may be transferred to the matrix storage unit 320.
 第1例および第2例に示したパイプライン処理の場合、パイプライン演算部330は、サイクル毎に異なる部分ベクトルvb11、vb12、vb21、vb22を使用するが、部分行列A11、A12、A21、A22は2サイクルに1つずつ使用する。このため、行列記憶部320は、2サイクルに1つずつ部分行列を出力できるスループットを有すればよく、行列記憶部320の消費電力および回路規模を低減することができる。 In the case of the pipeline processing shown in the first example and the second example, the pipeline calculation unit 330 uses the partial vectors vb11, vb12, vb21, and vb22 that are different for each cycle, but the submatrix A11, A12, A21, A22. Is used once every two cycles. Therefore, the matrix storage unit 320 only needs to have a throughput capable of outputting a submatrix once every two cycles, and the power consumption and the circuit scale of the matrix storage unit 320 can be reduced.
 第3例および第4例に示したパイプライン処理の場合、パイプライン演算部330は、サイクル毎に異なる部分行列A11、A12、A21、A22を使用するが、部分ベクトルvb11、vb12、vb21、vb22は2サイクルに1つずつ使用する。このため、行列記憶部320は、2サイクルに1つずつ部分ベクトルを出力できるスループットを有すればよく、ベクトル記憶部310の消費電力および回路規模を低減することができる。 In the case of the pipeline processing shown in the third example and the fourth example, the pipeline calculation unit 330 uses submatrixes A11, A12, A21, and A22 that are different for each cycle, but the partial vectors vb11, vb12, vb21, and vb22. Is used once every two cycles. Therefore, the matrix storage unit 320 only needs to have a throughput capable of outputting a partial vector once every two cycles, and the power consumption and the circuit scale of the vector storage unit 310 can be reduced.
 演算装置300の設計者または演算装置300を使用するユーザは、演算装置300の回路規模をより小さくできるように、または、演算装置300の消費電力をより小さくできるように、パイプライン処理の実行順序を選択してよい。 The designer of the arithmetic unit 300 or the user who uses the arithmetic unit 300 can execute the pipeline processing so that the circuit scale of the arithmetic unit 300 can be made smaller or the power consumption of the arithmetic unit 300 can be made smaller. May be selected.
 本発明の様々な実施形態は、フローチャートおよびブロック図を参照して記載されてよく、ここにおいてブロックは、(1)操作が実行されるプロセスの段階または(2)操作を実行する役割を持つ装置のセクションを表わしてよい。特定の段階およびセクションが、専用回路、コンピュータ可読媒体上に格納されるコンピュータ可読命令と共に供給されるプログラマブル回路、およびコンピュータ可読媒体上に格納されるコンピュータ可読命令と共に供給されるプロセッサのいずれかによって実装されてよい。専用回路は、デジタルおよびアナログのいずれかのハードウェア回路を含んでよく、集積回路(IC)およびディスクリート回路の何れかを含んでよい。プログラマブル回路は、論理AND、論理OR、論理XOR、論理NAND、論理NOR、および他の論理操作、フリップフロップ、レジスタ、フィールドプログラマブルゲートアレイ(FPGA)、プログラマブルロジックアレイ(PLA)等のようなメモリ要素等を含む、再構成可能なハードウェア回路を含んでよい。 Various embodiments of the present invention may be described with reference to flowcharts and block diagrams, wherein the block is (1) a stage of the process in which the operation is performed or (2) a device having a role of performing the operation. May represent a section of. Specific stages and sections are implemented by either a dedicated circuit, a programmable circuit supplied with computer-readable instructions stored on a computer-readable medium, or a processor supplied with computer-readable instructions stored on a computer-readable medium. May be done. Dedicated circuits may include either digital or analog hardware circuits, and may include either integrated circuits (ICs) or discrete circuits. Programmable circuits are memory elements such as logical AND, logical OR, logical XOR, logical NAND, logical NOR, and other logical operations, flip-flops, registers, field programmable gate arrays (FPGA), programmable logic arrays (PLA), etc. May include reconfigurable hardware circuits, including.
 コンピュータ可読媒体は、適切なデバイスによって実行される命令を格納可能な任意の有形なデバイスを含んでよく、その結果、そこに格納される命令を有するコンピュータ可読媒体は、フローチャートまたはブロック図で指定された操作を実行するための手段を作成すべく実行され得る命令を含む、製品を備えることになる。コンピュータ可読媒体の例としては、電子記憶媒体、磁気記憶媒体、光記憶媒体、電磁記憶媒体、半導体記憶媒体等が含まれてよい。コンピュータ可読媒体のより具体的な例としては、フロッピー(登録商標)ディスク、ディスケット、ハードディスク、ランダムアクセスメモリ(RAM)、リードオンリメモリ(ROM)、消去可能プログラマブルリードオンリメモリ(EPROMまたはフラッシュメモリ)、電気的消去可能プログラマブルリードオンリメモリ(EEPROM)、静的ランダムアクセスメモリ(SRAM)、コンパクトディスクリードオンリメモリ(CD-ROM)、デジタル多用途ディスク(DVD)、ブルーレイ(RTM)ディスク、メモリスティック、集積回路カード等が含まれてよい。 The computer-readable medium may include any tangible device capable of storing instructions executed by the appropriate device, so that the computer-readable medium having the instructions stored therein is specified in a flowchart or block diagram. It will be equipped with a product that contains instructions that can be executed to create means for performing the operation. Examples of computer-readable media may include electronic storage media, magnetic storage media, optical storage media, electromagnetic storage media, semiconductor storage media, and the like. More specific examples of computer-readable media include floppy (registered trademark) disks, diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), Electrically erasable programmable read-only memory (EEPROM), static random access memory (SRAM), compact disc read-only memory (CD-ROM), digital versatile disc (DVD), Blu-ray (RTM) disc, memory stick, integrated A circuit card or the like may be included.
 コンピュータ可読命令は、アセンブラ命令、命令セットアーキテクチャ(ISA)命令、マシン命令、マシン依存命令、マイクロコード、ファームウェア命令、状態設定データ、またはSmalltalk、JAVA(登録商標)、C++等のようなオブジェクト指向プログラミング言語、および「C」プログラミング言語または同様のプログラミング言語のような従来の手続型プログラミング言語を含む、1または複数のプログラミング言語の任意の組み合わせで記述されたソースコードまたはオブジェクトコードのいずれかを含んでよい。 Computer-readable instructions are assembler instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state-setting data, or object-oriented programming such as Smalltalk, JAVA®, C ++, etc. Contains either source code or object code written in any combination of one or more programming languages, including languages and traditional procedural programming languages such as the "C" programming language or similar programming languages. Good.
 コンピュータ可読命令は、汎用コンピュータ、特殊目的のコンピュータ、若しくは他のプログラム可能なデータ処理装置のプロセッサまたはプログラマブル回路に対し、ローカルにまたはローカルエリアネットワーク(LAN)、インターネット等のようなワイドエリアネットワーク(WAN)を介して提供され、フローチャートまたはブロック図で指定された操作を実行するための手段を作成すべく、コンピュータ可読命令を実行してよい。プロセッサの例としては、コンピュータプロセッサ、処理ユニット、マイクロプロセッサ、デジタル信号プロセッサ、コントローラ、マイクロコントローラ等を含む。 Computer-readable instructions are applied locally or to a processor or programmable circuit of a general purpose computer, special purpose computer, or other programmable data processing device, or to a wide area network (WAN) such as the local area network (LAN), the Internet, etc. ) May be executed to create a means for performing the operation specified in the flowchart or block diagram. Examples of processors include computer processors, processing units, microprocessors, digital signal processors, controllers, microcontrollers and the like.
 図8は、本発明の複数の態様が全体的または部分的に具現化されてよいコンピュータ2200の例を示す。コンピュータ2200にインストールされたプログラムは、コンピュータ2200に、本発明の実施形態に係る装置に関連付けられる操作または当該装置の1または複数のセクションとして機能させることができてもよいし、または当該操作または当該1または複数のセクションを実行させることができてもよいし、コンピュータ2200に、本発明の実施形態に係るプロセスまたは当該プロセスの段階を実行させることができてもよい。そのようなプログラムは、コンピュータ2200に、本明細書に記載のフローチャートおよびブロック図のブロックのうちのいくつかまたはすべてに関連付けられた特定の操作を実行させるべく、CPU2212によって実行されてよい。 FIG. 8 shows an example of a computer 2200 in which a plurality of aspects of the present invention may be embodied in whole or in part. The program installed on the computer 2200 may allow the computer 2200 to function as an operation associated with the device according to an embodiment of the present invention or as one or more sections of the device, or the operation or the device. It may be possible to have one or more sections run, or the computer 2200 may be able to run a process according to an embodiment of the invention or a stage of the process. Such a program may be run by the CPU 2212 to cause the computer 2200 to perform certain operations associated with some or all of the blocks in the flowcharts and block diagrams described herein.
 本実施形態によるコンピュータ2200は、CPU2212、RAM2214、グラフィックコントローラ2216、およびディスプレイデバイス2218を含み、それらはホストコントローラ2210によって相互に接続されている。コンピュータ2200はまた、通信インターフェイス2222、ハードディスクドライブ2224、DVD-ROMドライブ2226、およびICカードドライブのような入出力ユニットを含み、それらは入出力コントローラ2220を介してホストコントローラ2210に接続されている。コンピュータはまた、ROM2230およびキーボード2242のようなレガシの入出力ユニットを含み、それらは入出力チップ2240を介して入出力コントローラ2220に接続されている。 The computer 2200 according to this embodiment includes a CPU 2212, a RAM 2214, a graphic controller 2216, and a display device 2218, which are connected to each other by a host controller 2210. The computer 2200 also includes input / output units such as a communication interface 2222, a hard disk drive 2224, a DVD-ROM drive 2226, and an IC card drive, which are connected to the host controller 2210 via an input / output controller 2220. The computer also includes legacy I / O units such as the ROM 2230 and keyboard 2242, which are connected to the I / O controller 2220 via an I / O chip 2240.
 CPU2212は、ROM2230およびRAM2214内に格納されたプログラムに従い動作し、それにより各ユニットを制御する。グラフィックコントローラ2216は、RAM2214内に提供されるフレームバッファ等またはそれ自体の中にCPU2212によって生成されたイメージデータを取得し、イメージデータがディスプレイデバイス2218上に表示されるようにする。 The CPU 2212 operates according to the programs stored in the ROM 2230 and the RAM 2214, thereby controlling each unit. The graphic controller 2216 acquires the image data generated by the CPU 2212 in a frame buffer or the like provided in the RAM 2214 or itself so that the image data is displayed on the display device 2218.
 通信インターフェイス2222は、ネットワークを介して他の電子デバイスと通信する。ハードディスクドライブ2224は、コンピュータ2200内のCPU2212によって使用されるプログラムおよびデータを格納する。DVD-ROMドライブ2226は、プログラムまたはデータをDVD-ROM2201から読み取り、ハードディスクドライブ2224にRAM2214を介してプログラムまたはデータを提供する。ICカードドライブは、プログラムおよびデータをICカードから読み取り、プログラムおよびデータをICカードに書き込む。 The communication interface 2222 communicates with other electronic devices via the network. The hard disk drive 2224 stores programs and data used by the CPU 2212 in the computer 2200. The DVD-ROM drive 2226 reads the program or data from the DVD-ROM 2201 and provides the program or data to the hard disk drive 2224 via the RAM 2214. The IC card drive reads the program and data from the IC card and writes the program and data to the IC card.
 ROM2230はその中に、アクティブ化時にコンピュータ2200によって実行されるブートプログラム等、およびコンピュータ2200のハードウェアに依存するプログラムのいずれかを格納する。入出力チップ2240はまた、様々な入出力ユニットをパラレルポート、シリアルポート、キーボードポート、マウスポート等を介して、入出力コントローラ2220に接続してよい。 The ROM 2230 stores either a boot program or the like executed by the computer 2200 at the time of activation, or a program that depends on the hardware of the computer 2200. The input / output chip 2240 may also connect various input / output units to the input / output controller 2220 via a parallel port, a serial port, a keyboard port, a mouse port, and the like.
 プログラムが、DVD-ROM2201またはICカードのようなコンピュータ可読媒体によって提供される。プログラムは、コンピュータ可読媒体から読み取られ、コンピュータ可読媒体の例でもあるハードディスクドライブ2224、RAM2214、またはROM2230にインストールされ、CPU2212によって実行される。これらのプログラム内に記述される情報処理は、コンピュータ2200に読み取られ、プログラムと、上記様々なタイプのハードウェアリソースとの間の連携をもたらす。装置または方法が、コンピュータ2200の使用に従い情報の操作または処理を実現することによって構成されてよい。 The program is provided by a computer-readable medium such as a DVD-ROM 2201 or an IC card. The program is read from a computer-readable medium, installed on a hard disk drive 2224, RAM 2214, or ROM 2230, which is also an example of a computer-readable medium, and executed by the CPU 2212. The information processing described in these programs is read by the computer 2200 and provides a link between the program and the various types of hardware resources described above. The device or method may be configured to perform manipulation or processing of information in accordance with the use of computer 2200.
 例えば、通信がコンピュータ2200および外部デバイス間で実行される場合、CPU2212は、RAM2214にロードされた通信プログラムを実行し、通信プログラムに記述された処理に基づいて、通信インターフェイス2222に対し、通信処理を命令してよい。通信インターフェイス2222は、CPU2212の制御下、RAM2214、ハードディスクドライブ2224、DVD-ROM2201、またはICカードのような記録媒体内に提供される送信バッファ処理領域に格納された送信データを読み取り、読み取られた送信データをネットワークに送信し、またはネットワークから受信された受信データを記録媒体上に提供される受信バッファ処理領域等に書き込む。 For example, when communication is executed between the computer 2200 and an external device, the CPU 2212 executes a communication program loaded in the RAM 2214, and performs communication processing on the communication interface 2222 based on the processing described in the communication program. You may order. Under the control of the CPU 2212, the communication interface 2222 reads and reads transmission data stored in a transmission buffer processing area provided in a recording medium such as a RAM 2214, a hard disk drive 2224, a DVD-ROM 2201, or an IC card. The data is transmitted to the network, or the received data received from the network is written to the reception buffer processing area or the like provided on the recording medium.
 また、CPU2212は、ハードディスクドライブ2224、DVD-ROMドライブ2226(DVD-ROM2201)、ICカード等のような外部記録媒体に格納されたファイルまたはデータベースの全部または必要な部分がRAM2214に読み取られるようにし、RAM2214上のデータに対し様々なタイプの処理を実行してよい。CPU2212は次に、処理されたデータを外部記録媒体にライトバックする。 Further, the CPU 2212 causes the RAM 2214 to read all or necessary parts of a file or database stored in an external recording medium such as a hard disk drive 2224, a DVD-ROM drive 2226 (DVD-ROM2201), or an IC card. Various types of processing may be performed on the data on the RAM 2214. The CPU 2212 then writes back the processed data to an external recording medium.
 様々なタイプのプログラム、データ、テーブル、およびデータベースのような様々なタイプの情報が記録媒体に格納され、情報処理を受けてよい。CPU2212は、RAM2214から読み取られたデータに対し、本開示の随所に記載され、プログラムの命令シーケンスによって指定される様々なタイプの操作、情報処理、条件判断、条件分岐、無条件分岐、情報の検索および置換等のいずれかを含む、様々なタイプの処理を実行してよく、結果をRAM2214に対しライトバックする。また、CPU2212は、記録媒体内のファイル、データベース等における情報を検索してよい。例えば、各々が第2の属性の属性値に関連付けられた第1の属性の属性値を有する複数のエントリが記録媒体内に格納される場合、CPU2212は、第1の属性の属性値が指定される、条件に一致するエントリを当該複数のエントリの中から検索し、当該エントリ内に格納された第2の属性の属性値を読み取り、それにより予め定められた条件を満たす第1の属性に関連付けられた第2の属性の属性値を取得してよい。 Various types of information such as various types of programs, data, tables, and databases may be stored in recording media and processed. The CPU 2212 describes various types of operations, information processing, conditional judgment, conditional branching, unconditional branching, and information retrieval described in various parts of the present disclosure with respect to the data read from the RAM 2214, and is specified by the instruction sequence of the program. Various types of processing may be performed, including any of the and replacements, and the results are written back to the RAM 2214. Further, the CPU 2212 may search for information in a file, a database, or the like in the recording medium. For example, when a plurality of entries each having an attribute value of the first attribute associated with the attribute value of the second attribute are stored in the recording medium, the CPU 2212 specifies the attribute value of the first attribute. Search for an entry that matches the condition from the plurality of entries, read the attribute value of the second attribute stored in the entry, and associate it with the first attribute that satisfies the predetermined condition. The attribute value of the second attribute obtained may be acquired.
 上で説明したプログラムまたはソフトウェアモジュールは、コンピュータ2200上またはコンピュータ2200近傍のコンピュータ可読媒体に格納されてよい。また、専用通信ネットワークまたはインターネットに接続されたサーバーシステム内に提供されるハードディスクまたはRAMのような記録媒体が、コンピュータ可読媒体として使用可能であり、それによりプログラムを、ネットワークを介してコンピュータ2200に提供する。 The program or software module described above may be stored on a computer 2200 or on a computer-readable medium near the computer 2200. Also, a recording medium such as a hard disk or RAM provided within a dedicated communication network or a server system connected to the Internet can be used as a computer-readable medium, thereby providing the program to the computer 2200 over the network. To do.
 以上、本発明を実施の形態を用いて説明したが、本発明の技術的範囲は上記実施の形態に記載の範囲には限定されない。上記実施の形態に、多様な変更または改良を加えることが可能であることが当業者に明らかである。その様な変更または改良を加えた形態も本発明の技術的範囲に含まれ得ることが、請求の範囲の記載から明らかである。 Although the present invention has been described above using the embodiments, the technical scope of the present invention is not limited to the scope described in the above embodiments. It will be apparent to those skilled in the art that various changes or improvements can be made to the above embodiments. It is clear from the claims that the form with such modifications or improvements may also be included in the technical scope of the invention.
 請求の範囲、明細書、および図面中において示した装置、システム、プログラム、および方法における動作、手順、ステップ、および段階等の各処理の実行順序は、特段「より前に」、「先立って」等と明示しておらず、また、前の処理の出力を後の処理で用いるのでない限り、任意の順序で実現しうることに留意すべきである。請求の範囲、明細書、および図面中の動作フローに関して、便宜上「まず、」、「次に、」等を用いて説明したとしても、この順で実施することが必須であることを意味するものではない。 The order of execution of operations, procedures, steps, steps, etc. in the devices, systems, programs, and methods shown in the claims, specifications, and drawings is particularly "before" and "prior to". It should be noted that it can be realized in any order unless the output of the previous process is used in the subsequent process. Even if the claims, the specification, and the operation flow in the drawings are explained using "first", "next", etc. for convenience, it means that it is essential to carry out in this order. is not.
300 演算装置
310 ベクトル記憶部
320 行列記憶部
330 パイプライン演算部
340 結果記憶部
350 演算制御部
360 メインメモリ
370 メモリ制御部
2200 コンピュータ
2201 DVD-ROM
2210 ホストコントローラ
2212 CPU
2214 RAM
2216 グラフィックコントローラ
2218 ディスプレイデバイス
2220 入出力コントローラ
2222 通信インターフェイス
2224 ハードディスクドライブ
2226 DVD-ROMドライブ
2230 ROM
2240 入出力チップ
2242 キーボード
300 Arithmetic logic unit 310 Vector storage unit 320 Matrix storage unit 330 Pipeline calculation unit 340 Result storage unit 350 Arithmetic control unit 360 Main memory 370 Memory control unit 2200 Computer 2201 DVD-ROM
2210 Host controller 2212 CPU
2214 RAM
2216 Graphic controller 2218 Display device 2220 I / O controller 2222 Communication interface 2224 Hard disk drive 2226 DVD-ROM drive 2230 ROM
2240 I / O chip 2242 keyboard

Claims (9)

  1.  第1ベクトルを分割した第1の複数の部分ベクトルのうち、第1部分ベクトルを少なくとも記憶するベクトル記憶部と、
     前記第1ベクトルに乗じる第1行列を行方向および列方向に分割した第1の複数の部分行列のうち、前記第1部分ベクトルに乗じるべき第1部分行列を少なくとも記憶する行列記憶部と、
     パイプライン演算により、前記行列記憶部に記憶された部分行列と前記ベクトル記憶部に記憶された部分ベクトルとの行列ベクトル積に、中間ベクトルを加える演算を実行可能なパイプライン演算部と、
     前記パイプライン演算部が、前記第1部分行列および前記第1部分ベクトルの行列ベクトル積のパイプライン演算中に、前記第1部分ベクトルまたは前記第1部分行列を用いた他の行列ベクトル積の演算の実行を前記パイプライン演算部に指示する演算制御部と
     を備える演算装置。
    Of the first plurality of partial vectors obtained by dividing the first vector, a vector storage unit that stores at least the first partial vector and a vector storage unit.
    A matrix storage unit that stores at least the first submatrix to be multiplied by the first submatrix among the first plurality of submatrixes obtained by dividing the first matrix to be multiplied by the first vector in the row direction and the column direction.
    A pipeline calculation unit capable of executing an operation of adding an intermediate vector to the matrix vector product of the submatrix stored in the matrix storage unit and the partial vector stored in the vector storage unit by the pipeline calculation.
    The pipeline calculation unit calculates the first partial vector or another matrix vector product using the first partial matrix during the pipeline calculation of the matrix vector product of the first partial matrix and the first partial vector. An arithmetic device including an arithmetic control unit that instructs the pipeline arithmetic unit to execute the above.
  2.  前記ベクトル記憶部は、前記第1の複数の部分ベクトルのうち、第2部分ベクトルを更に記憶し、
     前記行列記憶部は、前記第1の複数の部分行列のうち、前記第2部分ベクトルに乗じるべき第2部分行列を更に記憶し、
     前記演算制御部は、前記第1部分行列および前記第1部分ベクトルの行列ベクトル積の演算結果が遅延なく利用可能となるサイクル以降に、前記第2部分行列および前記第2部分ベクトルの行列ベクトル積を、前記第1部分行列および前記第1部分ベクトルの行列ベクトル積の演算結果に加える演算の実行を前記パイプライン演算部に指示する
     請求項1に記載の演算装置。
    The vector storage unit further stores the second partial vector among the first plurality of partial vectors.
    The matrix storage unit further stores the second submatrix to be multiplied by the second submatrix among the first plurality of submatrixes.
    After the cycle in which the calculation result of the matrix vector product of the first submatrix and the first subvector becomes available without delay, the arithmetic control unit has the matrix vector product of the second submatrix and the second subvector. The arithmetic apparatus according to claim 1, wherein the pipeline arithmetic unit is instructed to execute an operation of adding the first partial matrix and the matrix vector product of the first partial vector to the calculation result.
  3.  前記ベクトル記憶部は、前記第1行列を乗じるべき第2ベクトルを分割した第2の複数の部分ベクトルのうち、前記第1部分行列を乗じるべき第3部分ベクトルを更に記憶し、
     前記演算制御部は、前記第1部分行列および前記第1部分ベクトルの行列ベクトル積のパイプライン演算中に、前記他の行列ベクトル積の演算として、前記第1部分行列および前記第3部分ベクトルの行列ベクトル積の演算の実行を前記パイプライン演算部に指示する
     請求項1または2に記載の演算装置。
    The vector storage unit further stores the third partial vector to be multiplied by the first submatrix among the second plurality of partial vectors obtained by dividing the second vector to be multiplied by the first matrix.
    During the pipeline operation of the matrix-vector product of the first sub-matrix and the first sub-vector, the arithmetic control unit performs the operation of the other matrix-vector product of the first sub-matrix and the third sub-vector. The arithmetic apparatus according to claim 1 or 2, which instructs the pipeline arithmetic unit to execute an operation of a matrix-vector product.
  4.  前記第1ベクトルおよび前記第2ベクトルは、前記第1行列に乗じるべき第2行列に含まれる列ベクトルである請求項3に記載の演算装置。 The arithmetic unit according to claim 3, wherein the first vector and the second vector are column vectors included in the second matrix to be multiplied by the first matrix.
  5.  前記ベクトル記憶部は、前記第2行列に含まれる複数の前記第2ベクトルを記憶し、
     前記演算制御部は、前記第1部分行列および前記第1部分ベクトルの行列ベクトル積のパイプライン演算の開始後から演算結果が遅滞なく利用可能となる前までの間の各サイクルを、前記第1部分行列および前記複数の第2ベクトルのそれぞれからの前記第3部分ベクトルの行列ベクトル積の演算で充填する
     請求項4に記載の演算装置。
    The vector storage unit stores a plurality of the second vectors included in the second matrix, and stores the second vector.
    The arithmetic control unit performs each cycle from the start of the pipeline operation of the matrix vector product of the first partial matrix and the first partial vector to before the calculation result becomes available without delay. The arithmetic apparatus according to claim 4, wherein the matrix vector product of the third partial vector from each of the partial matrix and the plurality of second vectors is filled.
  6.  前記行列記憶部は、前記第1の複数の部分行列のうち、前記第1部分ベクトルに乗じるべき第3部分行列を更に記憶し、
     前記演算制御部は、前記第1部分行列および前記第1部分ベクトルの行列ベクトル積のパイプライン演算中に、前記他の行列ベクトル積の演算として、前記第3部分行列および前記第1部分ベクトルの行列ベクトル積の演算の実行を前記パイプライン演算部に指示する
     請求項1または2に記載の演算装置。
    The matrix storage unit further stores a third submatrix to be multiplied by the first submatrix among the first plurality of submatrixes.
    During the pipeline operation of the matrix-vector product of the first sub-matrix and the first sub-vector, the arithmetic control unit performs the operation of the other matrix-vector product of the third sub-matrix and the first sub-vector. The arithmetic apparatus according to claim 1 or 2, which instructs the pipeline arithmetic unit to execute an arithmetic of a matrix vector product.
  7.  前記行列記憶部は、複数の前記第3部分行列を記憶し、
     前記演算制御部は、前記第1部分行列および前記第1部分ベクトルの行列ベクトル積のパイプライン演算の開始後から演算結果が遅滞なく利用可能となる前までの間の各サイクルを、前記複数の第3部分行列のそれぞれおよび前記第1部分ベクトルの行列ベクトル積の演算で充填する
     請求項6に記載の演算装置。
    The matrix storage unit stores a plurality of the third submatrix,
    The arithmetic control unit performs each cycle from the start of the pipeline operation of the matrix vector product of the first partial matrix and the first partial vector to before the calculation result becomes available without delay. The arithmetic apparatus according to claim 6, wherein the matrix vector product of each of the third sub-matrix and the first sub-vector is filled.
  8.  ベクトル記憶部が、第1ベクトルを分割した第1の複数の部分ベクトルのうち、第1部分ベクトルを少なくとも記憶し、
     行列記憶部が、前記第1ベクトルに乗じる第1行列を行方向および列方向に分割した第1の複数の部分行列のうち、前記第1部分ベクトルに乗じるべき第1部分行列を少なくとも記憶し、
     パイプライン演算により、前記行列記憶部に記憶された部分行列と前記ベクトル記憶部に記憶された部分ベクトルとの行列ベクトル積に、中間ベクトルを加える演算を実行可能なパイプライン演算部が、前記第1部分行列および前記第1部分ベクトルの行列ベクトル積のパイプライン演算中に、前記第1部分ベクトルまたは前記第1部分行列を用いた他の行列ベクトル積の演算の実行を開始する
     演算方法。
    The vector storage unit stores at least the first partial vector among the first plurality of partial vectors obtained by dividing the first vector.
    The matrix storage unit stores at least the first submatrix to be multiplied by the first submatrix among the first plurality of submatrixes obtained by dividing the first matrix to be multiplied by the first vector in the row direction and the column direction.
    The pipeline calculation unit capable of executing an operation of adding an intermediate vector to the matrix vector product of the partial matrix stored in the matrix storage unit and the partial vector stored in the vector storage unit by the pipeline calculation is the first. A calculation method for starting execution of an operation of another matrix vector product using the first partial vector or the first partial matrix during a pipeline calculation of the matrix vector product of the one partial matrix and the first partial vector.
  9.  演算装置によって実行される演算プログラムであって、
     前記演算装置は、
     第1ベクトルを分割した第1の複数の部分ベクトルのうち、第1部分ベクトルを少なくとも記憶するベクトル記憶部と、
     前記第1ベクトルに乗じる第1行列を行方向および列方向に分割した第1の複数の部分行列のうち、前記第1部分ベクトルに乗じるべき第1部分行列を少なくとも行列記憶部と、
     パイプライン演算により、前記行列記憶部に記憶された部分行列と前記ベクトル記憶部に記憶された部分ベクトルとの行列ベクトル積に、中間ベクトルを加える演算を実行可能なパイプライン演算部と
     を備え、
     当該演算プログラムは、前記演算装置に、前記第1部分行列および前記第1部分ベクトルの行列ベクトル積のパイプライン演算中に、前記第1部分ベクトルまたは前記第1部分行列を用いた他の行列ベクトル積の演算の実行を開始させるためのものである
     演算プログラム。
    An arithmetic program executed by an arithmetic unit,
    The arithmetic unit
    Of the first plurality of partial vectors obtained by dividing the first vector, a vector storage unit that stores at least the first partial vector and a vector storage unit.
    Of the first plurality of submatrixes obtained by dividing the first matrix to be multiplied by the first vector in the row direction and the column direction, the first submatrix to be multiplied by the first submatrix is at least a matrix storage unit and a matrix storage unit.
    A pipeline calculation unit capable of executing an operation of adding an intermediate vector to the matrix vector product of the submatrix stored in the matrix storage unit and the partial vector stored in the vector storage unit by the pipeline calculation is provided.
    The arithmetic program supplies the arithmetic apparatus with the first partial vector or another matrix vector using the first partial matrix during a pipeline operation of the matrix vector product of the first partial matrix and the first partial vector. An arithmetic program for initiating the execution of a product operation.
PCT/JP2020/022377 2019-06-07 2020-06-05 Calculation device, calculation method, and calculation program WO2020246598A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2019-107283 2019-06-07
JP2019107283A JP7241397B2 (en) 2019-06-07 2019-06-07 Arithmetic Device, Arithmetic Method, and Arithmetic Program

Publications (1)

Publication Number Publication Date
WO2020246598A1 true WO2020246598A1 (en) 2020-12-10

Family

ID=73652229

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/022377 WO2020246598A1 (en) 2019-06-07 2020-06-05 Calculation device, calculation method, and calculation program

Country Status (2)

Country Link
JP (1) JP7241397B2 (en)
WO (1) WO2020246598A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH02266458A (en) * 1989-04-06 1990-10-31 Nec Corp Neural network simulation device
JPH0644196A (en) * 1992-07-24 1994-02-18 Toshiba Corp Microprocessor for parallel computer

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6044196B2 (en) 2012-09-04 2016-12-14 リコーイメージング株式会社 Shooting lens controller

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH02266458A (en) * 1989-04-06 1990-10-31 Nec Corp Neural network simulation device
JPH0644196A (en) * 1992-07-24 1994-02-18 Toshiba Corp Microprocessor for parallel computer

Also Published As

Publication number Publication date
JP7241397B2 (en) 2023-03-17
JP2020201659A (en) 2020-12-17

Similar Documents

Publication Publication Date Title
KR102413832B1 (en) vector multiply add instruction
US11175920B2 (en) Efficient work execution in a parallel computing system
US9104633B2 (en) Hardware for performing arithmetic operations
US20210216318A1 (en) Vector Processor Architectures
EP2951681B1 (en) Solution to divergent branches in a simd core using hardware pointers
US9355061B2 (en) Data processing apparatus and method for performing scan operations
CN104838357A (en) Vectorization of collapsed multi-nested loops
KR102379894B1 (en) Apparatus and method for managing address conflicts when performing vector operations
US9965275B2 (en) Element size increasing instruction
EP2951682B1 (en) Hardware and software solutions to divergent branches in a parallel pipeline
CN111752530A (en) Machine learning architecture support for block sparsity
JPH07244589A (en) Computer system and method to solve predicate and boolean expression
TWI791694B (en) Vector add-with-carry instruction
WO2020246598A1 (en) Calculation device, calculation method, and calculation program
US20150106603A1 (en) Method and apparatus of instruction scheduling using software pipelining
JP2009507292A (en) Processor array with separate serial module
US11354126B2 (en) Data processing
US20230214351A1 (en) Reconfigurable simd engine

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20819387

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20819387

Country of ref document: EP

Kind code of ref document: A1