WO2020246598A1 - Dispositif, procédé et programme de calcul - Google Patents

Dispositif, procédé et programme de calcul Download PDF

Info

Publication number
WO2020246598A1
WO2020246598A1 PCT/JP2020/022377 JP2020022377W WO2020246598A1 WO 2020246598 A1 WO2020246598 A1 WO 2020246598A1 JP 2020022377 W JP2020022377 W JP 2020022377W WO 2020246598 A1 WO2020246598 A1 WO 2020246598A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
matrix
partial
storage unit
calculation
Prior art date
Application number
PCT/JP2020/022377
Other languages
English (en)
Japanese (ja)
Inventor
淳一郎 牧野
戎崎 俊一
Original Assignee
国立研究開発法人理化学研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 国立研究開発法人理化学研究所 filed Critical 国立研究開発法人理化学研究所
Publication of WO2020246598A1 publication Critical patent/WO2020246598A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization

Definitions

  • the present invention relates to an arithmetic unit, an arithmetic method, and an arithmetic program.
  • matrix matrix product (hereinafter referred to as "matrix product”) and matrix vector product occupy most of the calculation amount. Therefore, arithmetic units and arithmetic methods for efficiently executing such matrix operations have been developed (see Patent Documents 1 to 3). Processors capable of performing matrix operations have also been developed.
  • Patent Document 1 International Publication No. 2018/207926
  • Patent Document 2 Japanese Unexamined Patent Publication No. 2018-139045
  • Patent Document 3 Japanese Unexamined Patent Publication No. 2018-197906
  • the matrix-vector product of the n-dimensional square matrix and the n-dimensional vector includes multiplication of n2 and addition of about n2, and has a calculation amount of about 2n2. Therefore, when the n-dimensional square matrix is fixed, the amount of calculation of the matrix vector product is on the order of n2 with respect to the input of the n-dimensional vector. Therefore, if the matrix size is increased and the matrix algorithm is increased, the ratio of the data load amount to the calculation amount can be reduced. However, when the matrix arithmetic unit is made large, the load / store capacity of a register file or the like becomes relatively low, and the processing performance of operations on a matrix having a small size and operations other than the matrix becomes relatively low.
  • an arithmetic unit may include a vector storage unit that stores at least the first partial vector among the first plurality of partial vectors obtained by dividing the first vector.
  • the arithmetic unit has a matrix storage unit that stores at least the first matrix to be multiplied by the first vector among the first plurality of matrix obtained by dividing the first matrix to be multiplied by the first vector in the row direction and the column direction. You may be prepared.
  • the arithmetic unit includes a pipeline arithmetic unit capable of executing an operation of adding an intermediate vector to the matrix vector product of the submatrix stored in the matrix storage unit and the partial vector stored in the vector storage unit by the pipeline operation. You can.
  • the pipeline arithmetic unit performs an operation of another matrix vector product using the first partial vector or the first partial matrix during the pipeline operation of the matrix vector product of the first partial matrix and the first partial vector.
  • An arithmetic control unit that instructs the pipeline arithmetic unit to execute may be provided.
  • the vector storage unit may further store the second partial vector among the first plurality of partial vectors.
  • the matrix storage unit may further store the second submatrix to be multiplied by the second submatrix among the first plurality of submatrixes.
  • the arithmetic control unit performs the matrix-vector product of the second sub-matrix and the second sub-vector after the cycle in which the arithmetic result of the matrix-vector product of the first sub-matrix and the first sub-vector becomes available without delay.
  • the pipeline calculation unit may be instructed to execute an operation to be added to the operation result of the matrix vector product of the matrix and the first partial vector.
  • the vector storage unit may further store the third partial vector to be multiplied by the first submatrix among the second plurality of partial vectors obtained by dividing the second vector to be multiplied by the first matrix.
  • the arithmetic control unit performs the operation of the matrix-vector product of the first sub-matrix and the third sub-vector as the operation of the other matrix-vector product. The execution may be instructed to the pipeline calculation unit.
  • the first vector and the second vector may be column vectors included in the second matrix to be multiplied by the first matrix.
  • the vector storage unit may store a plurality of second vectors included in the second matrix.
  • the arithmetic control unit performs each cycle from the start of the pipeline operation of the matrix vector product of the first submatrix and the first subvector to before the operation result becomes available without delay in the first submatrix and a plurality of cycles. It may be filled by the calculation of the matrix vector product of the third partial vector from each of the second vectors of.
  • the matrix storage unit may further store the third submatrix to be multiplied by the first submatrix among the first plurality of submatrixes.
  • the arithmetic control unit performs the operation of the matrix-vector product of the third sub-matrix and the first sub-vector as the operation of the other matrix-vector product. The execution may be instructed to the pipeline calculation unit.
  • the matrix storage unit may store a plurality of third submatrixes.
  • the arithmetic control unit performs each cycle from the start of the pipeline operation of the matrix vector product of the first submatrix and the first subvector to before the operation result becomes available without delay, and a plurality of third submatrix. Each of the above and the matrix vector product of the first partial vector may be filled.
  • a calculation method may include that the vector storage unit stores at least the first partial vector among the first plurality of partial vectors obtained by dividing the first vector.
  • the matrix storage unit stores at least the first submatrix to be multiplied by the first submatrix among the first plurality of submatrixes obtained by dividing the first matrix to be multiplied by the first vector in the row direction and the column direction. You may be prepared to do.
  • the calculation method is a pipeline calculation unit that can execute an operation to add an intermediate vector to the matrix vector product of the partial matrix stored in the matrix storage unit and the partial vector stored in the vector storage unit by pipeline operation. During the pipeline operation of the matrix-vector product of the first partial matrix and the first partial vector, it may be provided to start executing the operation of another matrix-vector product using the first partial vector or the first partial matrix.
  • an arithmetic program executed by an arithmetic unit may include a vector storage unit that stores at least the first partial vector among the first plurality of partial vectors obtained by dividing the first vector.
  • the arithmetic unit includes at least a matrix storage unit for the first submatrix to be multiplied by the first submatrix among the first plurality of submatrixes obtained by dividing the first matrix to be multiplied by the first vector in the row direction and the column direction. Good.
  • the arithmetic unit includes a pipeline arithmetic unit capable of executing an operation of adding an intermediate vector to the matrix vector product of the submatrix stored in the matrix storage unit and the partial vector stored in the vector storage unit by the pipeline operation. You can.
  • the arithmetic program causes the arithmetic unit to execute an operation of the first partial vector or another matrix vector product using the first partial matrix during the pipeline operation of the matrix vector product of the first partial matrix and the first partial vector. It may be for getting started.
  • FIG. 8 shows an example of a computer 2200 in which a plurality of aspects of the present invention may be embodied in whole or in part.
  • FIG. 1 shows an example of a matrix operation according to this embodiment.
  • the matrices A, B, and C are square matrices of 8 rows and 8 columns.
  • the range of j ⁇ 3 the illustration of each component of bij and cij is omitted.
  • the column vector of the j column of the matrix C is the vector.
  • vector vcj (c1j, c2j, ..., c8j) T.
  • the matrix A is shown as A11, A21, A12, and A22 as sub-matrix obtained by dividing the matrix A into two in the row direction and the column direction, respectively.
  • the partial vectors obtained by dividing the vector vcj into two in the row direction are referred to as vc1j and vc2j.
  • the matrix product is shown as an example of the matrix operation, and the matrix vector product is described as being included in a part of the matrix product.
  • the matrices A, B, and C have powers of 2 elements in the row and column directions, and the matrix A is divided into powers of 2 in the row and column directions. The case will be illustrated.
  • the matrices A, B, and C may have a number of elements other than powers of 2 for at least one in the row or column direction, with the matrix A at least in the row or column direction.
  • One may be divided into numbers other than powers of 2 (eg, 3x3, 5x5, 9x9, 3x5, 5x9, etc.).
  • FIG. 2 shows an example of a calculation formula obtained by decomposing the matrix operation according to the present embodiment into a matrix-vector product of a submatrix and a partial vector.
  • the matrix vector product of the matrix A and the vector vb is the matrix vector product of the partial matrix and the partial vector. It includes d ⁇ d pieces.
  • FIG. 3 shows the configuration of the arithmetic unit 300 according to the present embodiment.
  • the arithmetic unit 300 can execute the matrix vector product of the matrix up to the number of rows and columns specified in the specifications and the vector up to the number of rows specified in the specifications as a unit operation by pipeline operation.
  • the arithmetic unit 300 divides a matrix-vector product of a matrix and a vector larger than a size that can be processed by one unit of operation into a plurality of sets of a matrix-vector product of a partial matrix and a partial vector that can be processed by one unit of operation. calculate.
  • the arithmetic unit 300 includes a vector storage unit 310, a matrix storage unit 320, a pipeline calculation unit 330, a result storage unit 340, an arithmetic control unit 350, a main memory 360, and a memory control unit 370.
  • the vector storage unit 310 stores at least the first partial vector among the first plurality of partial vectors obtained by dividing the first vector.
  • the vector storage unit 310 is a register as an example.
  • the vector storage unit 310 may be another storage device such as a cache memory that can supply the partial vector to the pipeline calculation unit 330 in a pipeline manner.
  • the first vector is an object vector to which at least one submatrix is multiplied by one matrix stored in the matrix storage unit 320.
  • the first vector is larger than the size that the arithmetic unit 300 can process in one unit of arithmetic.
  • the first plurality of partial vectors are obtained by dividing the first vector into a size that can be processed by one unit of operation.
  • the first vector corresponds to any of the vectors vbj (for example, vb1).
  • the vector storage unit 310 may have a sufficient storage area for further storing the second partial vector and the other partial vectors among the first plurality of partial vectors. For example, in the matrix operation of FIG. 1, the vector storage unit 310 may store the partial vector vb1j and the partial vector vb2j.
  • the matrix storage unit 320 stores at least the first submatrix to be multiplied by the first submatrix among the first plurality of submatrixes obtained by dividing the first matrix to be multiplied by the first vector in the row direction and the column direction.
  • the matrix storage unit 320 is a register as an example.
  • the matrix storage unit 320 may be another storage device such as a cache memory that can supply a submatrix to the pipeline calculation unit 330 in a pipeline manner.
  • the first matrix is an object matrix in which at least one partial vector is to be multiplied by the first vector stored in the vector storage unit 310.
  • the first matrix is larger than the size that the arithmetic unit 300 can process in one unit of arithmetic.
  • the first plurality of sub-matrixes are obtained by dividing the first matrix into a size that can be processed by the arithmetic unit 300 by one unit of arithmetic operations.
  • the first matrix corresponds to the matrix A.
  • the matrix storage unit 320 may have a sufficient storage area for further storing the second submatrix to be multiplied by the second submatrix and the other submatrix among the first plurality of submatrixes.
  • the matrix storage unit 320 may store the submatrix A11 to be multiplied by the submatrix vb1j and the submatrix A12 to be multiplied by the submatrix vb2j.
  • the first partial vector and the second partial vector are located in different row ranges in the first vector. Therefore, the first submatrix and the second submatrix are located in different column ranges in the target matrix.
  • the first sub-matrix and the second sub-matrix may be located in the same row range in the target matrix.
  • the pipeline calculation unit 330 is connected to the vector storage unit 310 and the matrix storage unit 320, receives the partial vector of the calculation target stored in the vector storage unit 310 from the vector storage unit 310, and stores the calculation in the matrix storage unit 320.
  • the target submatrix is received from the matrix storage unit 320.
  • the pipeline calculation unit 330 can execute a calculation of adding an intermediate vector to the matrix-vector product of the submatrix and the partial vector to be calculated by the pipeline calculation.
  • the pipeline calculation unit 330 calculates the matrix vector product of the submatrix of 4 rows and 4 columns and the subvector of 4 rows, and adds the intermediate vector of 4 rows to the matrix vector product to obtain the calculation result.
  • the operation of calculating the partial vector (also referred to as “result vector”) can be executed as a unit operation.
  • the fact that it can be executed as a unit of operation means that the pipeline calculation unit 330 sets the matrix-vector product of the sub-matrix and the sub-vector to be calculated in response to a request such as an external instruction or execution of an instruction. It means that the operations that add intermediate vectors are executed together and the result is output.
  • the pipeline calculation unit 330 may have a large number of calculation units so that all the basic operations (for example, multiplication and addition of values) included in this operation can be performed by separate calculation units. The operation of the unit may be performed by the same arithmetic unit.
  • the pipeline calculation unit 330 when the pipeline calculation unit 330 performs a pipeline calculation, it means that the pipeline calculation unit 330 can operate in parallel when the pipeline calculation unit 330 outputs the result after processing in a plurality of stages after the start of the calculation. To do. That is, the pipeline calculation unit 330 can sequentially start other operations in each cycle from the start of a certain operation to the output of the result, if there is no particular obstacle in execution.
  • the pipeline calculation unit 330 inputs a submatrix and a subvector in the first cycle, multiplies the corresponding elements of the submatrix and the subvector in the second cycle, and obtains the result vector in the third cycle.
  • the product calculated in the second cycle may be summed for each element to be included, and the partial vector of the calculation result may be output in the fourth cycle.
  • the pipeline calculation unit 330 can have a pipeline structure having an arbitrary number of stages, if necessary.
  • the result storage unit 340 is connected to the pipeline calculation unit 330.
  • the result storage unit 340 receives and stores the result vector output by the pipeline calculation unit 330.
  • the result vector is, for example, vc11 and vc21 in FIG.
  • the result storage unit 340 is a register as an example.
  • the result storage unit 340 may be another storage device such as a cache memory that can store the partial vector from the pipeline calculation unit 330 in a pipeline manner.
  • the vector storage unit 310, the matrix storage unit 320, and the result storage unit 340 may be implemented as the same storage device.
  • the calculation control unit 350 is connected to the vector storage unit 310, the matrix storage unit 320, the pipeline calculation unit 330, and the result storage unit 340.
  • the arithmetic control unit 350 requests in response to a matrix operation execution request such as receiving an instruction from the outside of the arithmetic unit 300 or decoding a matrix operation instruction during program execution in the arithmetic unit 300.
  • the vector storage unit 310, the matrix storage unit 320, the pipeline calculation unit 330, and the result storage unit 340 are controlled in order to execute the performed matrix operation.
  • the main memory 360 stores the matrix to be calculated and the calculation result.
  • the memory control unit 370 is connected between the vector storage unit 310, the matrix storage unit 320, the result storage unit 340, and the main memory 360.
  • the memory control unit 370 is located between the vector storage unit 310, the matrix storage unit 320, the result storage unit 340, and the main memory 360 in response to an external instruction or a memory access instruction during program execution in the arithmetic unit 300. Data transfer.
  • the memory control unit 370 mainly uses the partial vector specified by the vector load among the partial vectors stored in the main memory 360 in response to the request for the vector load from the main memory 360 to the vector storage unit 310. It is read from the memory 360 and stored in the vector storage unit 310. Further, the memory control unit 370 mainly uses the submatrix designated by the matrix load among the submatrix stored in the main memory 360 in response to the request for the matrix load from the main memory 360 to the matrix storage unit 320. It is read from the memory 360 and stored in the matrix storage unit 320.
  • the memory control unit 370 reads out the matrix or vector of the calculation result stored in the result storage unit 340 in response to the request for the matrix or vector store from the result storage unit 340 to the main memory 360. It is stored in the main memory 360.
  • the main memory 360 may not be provided in addition to the vector storage unit 310 and the matrix storage unit 320, but a relatively large memory that functions as the vector storage unit 310 and the matrix storage unit 320 may be provided.
  • a partial vector and a partial matrix may be supplied to the pipeline calculation unit 330 directly from the memory in a pipeline.
  • the waiting time for the second operation may be reduced to some extent by supplying the result of the first operation to the second operation (bypass or forwarding) without waiting for the result to be written to the register. Can be done.
  • the first operation and the second operation it is difficult to completely eliminate the vacancy generated in the pipeline of the pipeline calculation unit 330 due to the pipeline hazard.
  • the pipeline calculation unit 330 is performing the pipeline calculation of the matrix vector product (for example, A11 ⁇ vb11) of the first partial matrix and the first partial vector, and the first partial vector or the first partial matrix. Instructs the pipeline calculation unit 330 to execute another matrix vector product operation using.
  • the "other matrix vector product” is an operation that does not use the operation result of the matrix vector product of the first partial matrix and the first partial vector, and is an operation including the matrix vector product, that is, for example, the first part in the matrix vector product.
  • the operation may be such that an operation result other than the matrix-vector product of the matrix and the first partial vector is added.
  • the calculation control unit 350 has one or a plurality of calculation control units 350 that do not depend on the calculation result of the first calculation between the time when the pipeline calculation unit 330 waits for the calculation result of the first calculation and the time when the second calculation is started.
  • Another matrix vector product is input to the pipeline calculation unit 330, whereby the utilization efficiency of the pipeline calculation unit 330 can be improved.
  • the arithmetic control unit 350 determines the matrix-vector product of the second sub-matrix and the second sub-vector after the cycle in which the arithmetic result of the matrix-vector product of the first sub-matrix and the first sub-vector becomes available without delay.
  • the pipeline calculation unit 330 may be instructed to execute an operation to be added to the operation result of the matrix vector product of the first partial matrix and the first partial vector.
  • the operation control unit 350 can prevent a pipeline hazard from occurring in the second operation, and can input another operation of the matrix vector product between the first operation and the second operation. ..
  • FIG. 4 shows a first example of pipeline processing by the arithmetic unit 300 according to the present embodiment.
  • the arithmetic control unit 350 instructs the vector storage unit 310 to read the vb11 which is an example of the first subvector, and the matrix storage unit 320 reads the A11 which is an example of the first submatrix. Is instructed, and the matrix vector product of the first submatrix A11 and the first subvector vb11 is calculated and stored in the intermediate register (temporary register) vctmp1 of the pipeline calculation unit 330 as an intermediate vector in the middle of calculation. Is instructed to the pipeline calculation unit 330.
  • the vector storage unit 310 has a second vector (as an example) to be multiplied by the first matrix A in addition to vb11 which is an example of the first partial vector and vb21 which is an example of the second partial vector.
  • the third partial vector (vb12 as an example) to be multiplied by the first partial matrix A11 is further stored.
  • the first vector and the second vector are column vectors included in the second matrix B to be multiplied by the first matrix A, for example, the first vector is vb1 and the second vector is vb2.
  • the third partial vector is vb12 to be multiplied by the first submatrix A11 of the second plurality of partial vectors vbi2 obtained by dividing the second vector vb2.
  • the first vector and the second vector may each be separate vectors to be multiplied by the matrix A.
  • the arithmetic control unit 350 instructs the vector storage unit 310 to read the third partial vector vb12. Instruct the matrix storage unit 320 to read the first partial matrix A11, and execute the calculation of the matrix vector product of the first partial matrix and the third partial vector as another matrix vector product calculation that does not cause a pipeline hazard. Instruct the pipeline calculation unit 330. In response to this, the pipeline calculation unit 330 executes an operation of storing the matrix vector product of the first submatrix and the third submatrix vector in the intermediate register vctmp2 of the pipeline calculation unit 330 as an intermediate vector in the middle of calculation. ..
  • This operation corresponds to the operation of the first matrix vector product in the third row of FIG. 2, and the matrix vector products of cycles 0 and 1 are reflected in different result vectors vc11 and vc12. Therefore, since there is no dependency between these operations, the pipeline calculation unit 330 can execute these operations without causing a pipeline hazard.
  • the arithmetic control unit 350 instructs the vector storage unit 310 to read the vb21 which is an example of the second subvector, and the matrix storage unit 320 reads the A12 which is an example of the second submatrix. Is instructed, the matrix vector product of the second submatrix A12 and the second subvector vb21 is calculated, and the execution of the operation of adding the operation result vctmp1 of the operation of cycle 0 is instructed to the pipeline calculation unit 330, and the result of the operation is The result storage unit 350 is instructed to store the obtained partial vector vc11.
  • cycle 2 depends on the operation of cycle 0, but the operation control unit 350 inserts the operation of cycle 1 which does not depend on the operation of cycle 0 between the operations of cycle 0 and cycle 2. It is possible to improve the utilization efficiency of the pipeline of the pipeline calculation unit 330.
  • the vector storage unit 310 may further store the fourth subvector to be multiplied by the second submatrix A12 among the second plurality of subvectors.
  • the arithmetic control unit 350 instructs the vector storage unit 310 to read the vb22 which is an example of the fourth subvector, and the matrix storage unit 320 reads the A12 which is an example of the second submatrix.
  • the main memory 360 is instructed to store the obtained partial vector vc12.
  • the operation of cycle 3 depends on the operation of cycle 1, but the operation control unit 350 inserts the operation of cycle 2 which does not depend on the operation of cycle 1 between the operations of cycle 1 and cycle 3. It is possible to improve the utilization efficiency of the pipeline of the pipeline calculation unit 330.
  • the partial vectors (vc11, vc12) of the first row range (first to fourth rows) in the plurality of column vectors (vc1, vc2) of the matrix C are calculated in cycles 0 to 3, and cycles 4 to 4 to In 7, the partial vectors (vc21, vc22) of the second row range (fifth to eighth rows) in the plurality of column vectors (vc1, vc2) of the matrix C are calculated.
  • the operations of cycles 4 to 7 are the same except that the submatrixes A21 and A22 are used instead of the submatrixes A11 and A12 and the submatrix vc21 and vc22 are used instead of the submatrix vc11 and vc12.
  • the operation control unit 350 uses the first submatrix between the first operation of the matrix-vector product of the first submatrix and the first submatrix and the second operation using the operation result. Insert another matrix-vector product operation, that is, in this example, the matrix-vector product operation of the first submatrix and the third submatrix. As a result, the calculation control unit 350 can utilize one free cycle required between the first calculation and the second calculation.
  • the operation control unit 350 sets the matrix vector product of the first submatrix and the plurality of third subvectors in the first operation and the second operation. It may be inserted in between.
  • the vector storage unit 310 further stores a plurality of second vectors vb2, vb3, ... Included in the second matrix B.
  • the arithmetic control unit 350 performs each cycle from the start of the pipeline operation of the matrix vector product of the first partial matrix A11 and the first partial vector vb11 to before the calculation result becomes available without delay in the first portion.
  • the first vector and the plurality of second vectors may be arranged in the column order of the second matrix or the reverse order of the column order, or are not arranged in the column order of the second matrix, and are column vectors of arbitrary columns. It may be.
  • FIG. 5 shows a second example of pipeline processing by the arithmetic unit 300 according to the present embodiment.
  • the calculation control unit 350 may be used in cycles 4 to 5 in FIG. The calculation may be controlled to be performed before the calculation in cycles 2 to 3.
  • the operation control unit 350 uses the first submatrix while executing the pipeline operation of the first operation for calculating the matrix vector product of the first submatrix A11 and the first submatrix vector vb11.
  • the pipeline calculation unit 330 is made to execute the calculation of cycle 1 in FIG. 5, which is the calculation of the product, and the calculation of cycle 2 in FIG.
  • the calculation control unit 350 performs the calculation of cycle 3, which is the calculation of the matrix vector product of the second submatrix A21 used for the calculation of cycle 2 and the second partial vector vb12 used for the calculation of cycle 1. It may be executed between the first operation and the second operation. As a result, the calculation control unit 350 can further fill the empty cycle between the first calculation and the second calculation.
  • the execution order of the operations in cycles 0 to 3 may be arbitrary, and the execution order of the operations in cycles 4 to 7 may be determined according to the execution order of the corresponding operations in cycles 0 to 3.
  • FIG. 6 shows a third example of pipeline processing by the arithmetic unit 300 according to the present embodiment.
  • the calculation control unit 350 performs the same control as in cycle 0 of FIG.
  • the matrix storage unit 320 divided the first matrix A in the row direction and the column direction in addition to A11 which is an example of the first submatrix and A12 which is an example of the second submatrix.
  • the third submatrix (A21 as an example) to be multiplied by the first submatrix vector vb11 is further stored.
  • the third submatrix may be a submatrix included in a matrix other than the first matrix A.
  • the arithmetic control unit 350 instructs the vector storage unit 310 to read the first partial vector vb11.
  • the calculation of the matrix vector product of the third partial matrix A21 and the first partial vector vb11. Instruct the pipeline calculation unit 330 to execute.
  • the pipeline calculation unit 330 performs an operation to store the matrix vector product of the third submatrix A21 and the first submatrix vector vb11 in the intermediate register vctmp2 of the pipeline calculation unit 330 as an intermediate vector in the middle of calculation. Execute. This operation corresponds to the operation of the first matrix vector product in the second row of FIG. 2, and the matrix vector products of cycles 0 and 1 are reflected in different result vectors vc11 and vc21. Therefore, since there is no dependency between these operations, the pipeline calculation unit 330 can execute these operations without causing a pipeline hazard.
  • the calculation control unit 350 performs the same control as in cycle 2 of FIG.
  • the operation of cycle 2 depends on the operation of cycle 0, but the operation control unit 350 inserts the operation of cycle 1 which does not depend on the operation of cycle 0 between the operations of cycle 0 and cycle 2. It is possible to improve the utilization efficiency of the pipeline of the pipeline calculation unit 330.
  • the vector storage unit 310 may further store the fourth submatrix to be multiplied by the second submatrix vb21 among the first plurality of submatrixes Aij.
  • the arithmetic control unit 350 instructs the vector storage unit 310 to read the vb21 which is an example of the second subvector, and the matrix storage unit 320 reads the A22 which is an example of the fourth submatrix.
  • the main memory 360 is instructed to store the obtained partial vector vc21.
  • the operation of cycle 3 depends on the operation of cycle 1, but the operation control unit 350 inserts the operation of cycle 2 which does not depend on the operation of cycle 1 between the operations of cycle 1 and cycle 3. It is possible to improve the utilization efficiency of the pipeline of the pipeline calculation unit 330.
  • two subvectors vc11 and vc21 included in one column vector vc1 of the matrix C are calculated in cycles 0 to 3, and 2 included in another column vector vc2 of the matrix C in cycles 4 to 7.
  • Two partial vectors vc12 and vc22 are calculated.
  • the operations of cycles 4 to 7 are the same except that the partial vectors vb12 and vb22 are used instead of the partial vectors vb11 and vb21, and the partial vectors vc12 and vc22 are used instead of the partial vectors vc11 and vc21, and thus the description thereof will be omitted.
  • the arithmetic control unit 350 uses the first partial vector between the first operation of the matrix vector product of the first submatrix and the first subvector and the second operation using the operation result. Insert another matrix-vector product operation, that is, in this example, the matrix-vector product operation of the third partial matrix and the first partial vector. As a result, the calculation control unit 350 can utilize one free cycle required between the first calculation and the second calculation.
  • the operation control unit 350 sets the matrix vector product of each of the plurality of third submatrixes and the first submatrix as the first operation and the second operation. It may be inserted in between.
  • the matrix storage unit 320 stores a plurality of third submatrixes A21, A31, ... To be multiplied by the first submatrix vector included in the first matrix A.
  • the arithmetic control unit 350 performs a plurality of cycles from the start of the pipeline operation of the matrix vector product of the first partial matrix A11 and the first partial vector vb11 to before the calculation result becomes available without delay.
  • the first submatrix and the plurality of third submatrixes may be arranged in the same row range of the first matrix in the column order or the reverse order of the column order, and are not arranged in the column order of the second matrix. Each may be a submatrix of any column range.
  • FIG. 7 shows a fourth example of pipeline processing by the arithmetic unit 300 according to the present embodiment.
  • the calculation control unit 350 may be used in cycles 4 to 5 in FIG. The calculation may be controlled to be performed before the calculation in cycles 2 to 3.
  • the operation control unit 350 uses the first partial vector while executing the pipeline operation of the first operation for calculating the matrix vector product of the first partial matrix A11 and the first partial vector vb11.
  • the pipeline calculation unit 330 is made to execute the calculation of cycle 1 in FIG. 7, which is the calculation of the product, and the calculation of cycle 2 in FIG. 7, which is the calculation of the other matrix vector product using the first submatrix.
  • the calculation control unit 350 performs the calculation of cycle 3, which is the calculation of the matrix vector product of the third partial matrix A21 used for the calculation of cycle 1 and the second partial vector vb12 used for the calculation of cycle 2. It may be executed between the first operation and the second operation. As a result, the calculation control unit 350 can further fill the empty cycle between the first calculation and the second calculation.
  • the execution order of the operations in cycles 0 to 3 may be arbitrary, and the execution order of the operations in cycles 4 to 7 may be determined according to the execution order of the corresponding operations in cycles 0 to 3.
  • the pipeline processing of FIG. 7 is substantially the same as that in which the operations of cycles 1 and 2 in the pipeline processing of FIG. 5 are exchanged and the operations of cycles 5 and 6 are exchanged.
  • the calculation control unit 350 requires the pipeline calculation unit 330 for the partial vector and the submatrix used by the pipeline calculation unit 330.
  • the memory control unit 370 may be instructed to transfer from the main memory 360 to the vector storage unit 310 and the matrix storage unit 320 before.
  • the main memory 360 transfers the partial vectors vb11, vb12, vb21, vb22 to the vector storage unit 310 and the submatrixes A11 and A12 to the matrix storage unit 320 before the cycle 0. You may transfer it.
  • the main memory 360 transfers the partial vectors vb11 and vb12 to the vector storage unit 310 and the submatrix A11 to the matrix storage unit 320 before cycle 0, and before cycle 2,
  • the partial vectors vb21 and vb22 may be transferred to the vector storage unit 310, and the submatrix A12 may be transferred to the matrix storage unit 320.
  • the pipeline calculation unit 330 uses the partial vectors vb11, vb12, vb21, and vb22 that are different for each cycle, but the submatrix A11, A12, A21, A22. Is used once every two cycles. Therefore, the matrix storage unit 320 only needs to have a throughput capable of outputting a submatrix once every two cycles, and the power consumption and the circuit scale of the matrix storage unit 320 can be reduced.
  • the pipeline calculation unit 330 uses submatrixes A11, A12, A21, and A22 that are different for each cycle, but the partial vectors vb11, vb12, vb21, and vb22. Is used once every two cycles. Therefore, the matrix storage unit 320 only needs to have a throughput capable of outputting a partial vector once every two cycles, and the power consumption and the circuit scale of the vector storage unit 310 can be reduced.
  • the designer of the arithmetic unit 300 or the user who uses the arithmetic unit 300 can execute the pipeline processing so that the circuit scale of the arithmetic unit 300 can be made smaller or the power consumption of the arithmetic unit 300 can be made smaller. May be selected.
  • Various embodiments of the present invention may be described with reference to flowcharts and block diagrams, wherein the block is (1) a stage of the process in which the operation is performed or (2) a device having a role of performing the operation. May represent a section of. Specific stages and sections are implemented by either a dedicated circuit, a programmable circuit supplied with computer-readable instructions stored on a computer-readable medium, or a processor supplied with computer-readable instructions stored on a computer-readable medium. May be done.
  • Dedicated circuits may include either digital or analog hardware circuits, and may include either integrated circuits (ICs) or discrete circuits.
  • Programmable circuits are memory elements such as logical AND, logical OR, logical XOR, logical NAND, logical NOR, and other logical operations, flip-flops, registers, field programmable gate arrays (FPGA), programmable logic arrays (PLA), etc. May include reconfigurable hardware circuits, including.
  • the computer-readable medium may include any tangible device capable of storing instructions executed by the appropriate device, so that the computer-readable medium having the instructions stored therein is specified in a flowchart or block diagram. It will be equipped with a product that contains instructions that can be executed to create means for performing the operation. Examples of computer-readable media may include electronic storage media, magnetic storage media, optical storage media, electromagnetic storage media, semiconductor storage media, and the like.
  • Computer-readable media include floppy (registered trademark) disks, diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), Electrically erasable programmable read-only memory (EEPROM), static random access memory (SRAM), compact disc read-only memory (CD-ROM), digital versatile disc (DVD), Blu-ray (RTM) disc, memory stick, integrated A circuit card or the like may be included.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or flash memory erasable programmable read-only memory
  • EEPROM Electrically erasable programmable read-only memory
  • SRAM static random access memory
  • CD-ROM compact disc read-only memory
  • DVD digital versatile disc
  • RTM Blu-ray
  • Computer-readable instructions are assembler instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state-setting data, or object-oriented programming such as Smalltalk, JAVA®, C ++, etc. Contains either source code or object code written in any combination of one or more programming languages, including languages and traditional procedural programming languages such as the "C" programming language or similar programming languages. Good.
  • Computer-readable instructions are applied locally or to a processor or programmable circuit of a general purpose computer, special purpose computer, or other programmable data processing device, or to a wide area network (WAN) such as the local area network (LAN), the Internet, etc. ) May be executed to create a means for performing the operation specified in the flowchart or block diagram.
  • processors include computer processors, processing units, microprocessors, digital signal processors, controllers, microcontrollers and the like.
  • FIG. 8 shows an example of a computer 2200 in which a plurality of aspects of the present invention may be embodied in whole or in part.
  • the program installed on the computer 2200 may allow the computer 2200 to function as an operation associated with the device according to an embodiment of the present invention or as one or more sections of the device, or the operation or the device. It may be possible to have one or more sections run, or the computer 2200 may be able to run a process according to an embodiment of the invention or a stage of the process.
  • Such a program may be run by the CPU 2212 to cause the computer 2200 to perform certain operations associated with some or all of the blocks in the flowcharts and block diagrams described herein.
  • the computer 2200 includes a CPU 2212, a RAM 2214, a graphic controller 2216, and a display device 2218, which are connected to each other by a host controller 2210.
  • the computer 2200 also includes input / output units such as a communication interface 2222, a hard disk drive 2224, a DVD-ROM drive 2226, and an IC card drive, which are connected to the host controller 2210 via an input / output controller 2220.
  • the computer also includes legacy I / O units such as the ROM 2230 and keyboard 2242, which are connected to the I / O controller 2220 via an I / O chip 2240.
  • the CPU 2212 operates according to the programs stored in the ROM 2230 and the RAM 2214, thereby controlling each unit.
  • the graphic controller 2216 acquires the image data generated by the CPU 2212 in a frame buffer or the like provided in the RAM 2214 or itself so that the image data is displayed on the display device 2218.
  • the communication interface 2222 communicates with other electronic devices via the network.
  • the hard disk drive 2224 stores programs and data used by the CPU 2212 in the computer 2200.
  • the DVD-ROM drive 2226 reads the program or data from the DVD-ROM 2201 and provides the program or data to the hard disk drive 2224 via the RAM 2214.
  • the IC card drive reads the program and data from the IC card and writes the program and data to the IC card.
  • the ROM 2230 stores either a boot program or the like executed by the computer 2200 at the time of activation, or a program that depends on the hardware of the computer 2200.
  • the input / output chip 2240 may also connect various input / output units to the input / output controller 2220 via a parallel port, a serial port, a keyboard port, a mouse port, and the like.
  • the program is provided by a computer-readable medium such as a DVD-ROM 2201 or an IC card.
  • the program is read from a computer-readable medium, installed on a hard disk drive 2224, RAM 2214, or ROM 2230, which is also an example of a computer-readable medium, and executed by the CPU 2212.
  • the information processing described in these programs is read by the computer 2200 and provides a link between the program and the various types of hardware resources described above.
  • the device or method may be configured to perform manipulation or processing of information in accordance with the use of computer 2200.
  • the CPU 2212 executes a communication program loaded in the RAM 2214, and performs communication processing on the communication interface 2222 based on the processing described in the communication program. You may order.
  • the communication interface 2222 reads and reads transmission data stored in a transmission buffer processing area provided in a recording medium such as a RAM 2214, a hard disk drive 2224, a DVD-ROM 2201, or an IC card. The data is transmitted to the network, or the received data received from the network is written to the reception buffer processing area or the like provided on the recording medium.
  • the CPU 2212 causes the RAM 2214 to read all or necessary parts of a file or database stored in an external recording medium such as a hard disk drive 2224, a DVD-ROM drive 2226 (DVD-ROM2201), or an IC card. Various types of processing may be performed on the data on the RAM 2214. The CPU 2212 then writes back the processed data to an external recording medium.
  • an external recording medium such as a hard disk drive 2224, a DVD-ROM drive 2226 (DVD-ROM2201), or an IC card.
  • Various types of processing may be performed on the data on the RAM 2214.
  • the CPU 2212 then writes back the processed data to an external recording medium.
  • the CPU 2212 describes various types of operations, information processing, conditional judgment, conditional branching, unconditional branching, and information retrieval described in various parts of the present disclosure with respect to the data read from the RAM 2214, and is specified by the instruction sequence of the program. Various types of processing may be performed, including any of the and replacements, and the results are written back to the RAM 2214. Further, the CPU 2212 may search for information in a file, a database, or the like in the recording medium.
  • the CPU 2212 specifies the attribute value of the first attribute. Search for an entry that matches the condition from the plurality of entries, read the attribute value of the second attribute stored in the entry, and associate it with the first attribute that satisfies the predetermined condition.
  • the attribute value of the second attribute obtained may be acquired.
  • the program or software module described above may be stored on a computer 2200 or on a computer-readable medium near the computer 2200.
  • a recording medium such as a hard disk or RAM provided within a dedicated communication network or a server system connected to the Internet can be used as a computer-readable medium, thereby providing the program to the computer 2200 over the network. To do.
  • Arithmetic logic unit 310
  • Vector storage unit 320
  • Matrix storage unit 330
  • Pipeline calculation unit 340
  • Result storage unit 350
  • Arithmetic control unit 360
  • Main memory 370
  • Memory control unit 2200 Computer 2201 DVD-ROM 2210
  • Host controller 2212 CPU 2214
  • RAM 2216
  • Graphic controller 2218 Display device 2220 I / O controller 2222

Abstract

L'invention concerne un dispositif de calcul qui comporte : une unité de stockage de vecteur qui stocke, parmi une pluralité de premiers vecteurs partiels obtenus par division d'un premier vecteur, au moins un premier vecteur partiel ; une unité de stockage de matrice qui stocke, parmi une pluralité de premières sous-matrices obtenues par division d'une première matrice devant être multipliée par le premier vecteur dans les directions de rangée et de colonne, au moins une première sous-matrice devant être multipliée par le premier vecteur partiel ; une unité de calcul de pipeline qui, par l'intermédiaire d'un calcul de pipeline, exécute un calcul pour ajouter un vecteur intermédiaire à un produit de vecteur de matrice de la sous-matrice stockée dans l'unité de stockage de matrice et le vecteur partiel stocké dans l'unité de stockage de vecteur ; une unité de commande de calcul qui, pendant que l'unité de calcul de pipeline exécute le calcul de pipeline du produit de vecteur de matrice de la première sous-matrice et du premier vecteur partiel, ordonne à l'unité de calcul de pipeline d'exécuter le calcul d'un autre produit de vecteur de matrice à l'aide du premier vecteur partiel ou de la première sous-matrice.
PCT/JP2020/022377 2019-06-07 2020-06-05 Dispositif, procédé et programme de calcul WO2020246598A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2019-107283 2019-06-07
JP2019107283A JP7241397B2 (ja) 2019-06-07 2019-06-07 演算装置、演算方法、および演算プログラム

Publications (1)

Publication Number Publication Date
WO2020246598A1 true WO2020246598A1 (fr) 2020-12-10

Family

ID=73652229

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/022377 WO2020246598A1 (fr) 2019-06-07 2020-06-05 Dispositif, procédé et programme de calcul

Country Status (2)

Country Link
JP (1) JP7241397B2 (fr)
WO (1) WO2020246598A1 (fr)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH02266458A (ja) * 1989-04-06 1990-10-31 Nec Corp ニューラルネットワークシミュレーション装置
JPH0644196A (ja) * 1992-07-24 1994-02-18 Toshiba Corp 並列計算機用マイクロプロセッサ

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6044196B2 (ja) 2012-09-04 2016-12-14 リコーイメージング株式会社 撮影レンズ制御装置

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH02266458A (ja) * 1989-04-06 1990-10-31 Nec Corp ニューラルネットワークシミュレーション装置
JPH0644196A (ja) * 1992-07-24 1994-02-18 Toshiba Corp 並列計算機用マイクロプロセッサ

Also Published As

Publication number Publication date
JP7241397B2 (ja) 2023-03-17
JP2020201659A (ja) 2020-12-17

Similar Documents

Publication Publication Date Title
US11175920B2 (en) Efficient work execution in a parallel computing system
KR102413832B1 (ko) 벡터 곱셈 덧셈 명령
US9104633B2 (en) Hardware for performing arithmetic operations
US20210216318A1 (en) Vector Processor Architectures
EP2951681B1 (fr) Solution pour des branchements divergents dans un noyau simd utilisant des pointeurs matériels
US9355061B2 (en) Data processing apparatus and method for performing scan operations
KR102379894B1 (ko) 벡터 연산들 수행시의 어드레스 충돌 관리 장치 및 방법
CN104838357A (zh) 瓦解的多嵌套循环的向量化
US9965275B2 (en) Element size increasing instruction
EP2951682B1 (fr) Solutions matérielles et logicielles pour des branches divergentes dans un pipeline parallèle
CN111752530A (zh) 对块稀疏度的机器学习架构支持
JPH07244589A (ja) 述語、及びブール式を解くためのコンピュータ・システム、及び方法
TWI791694B (zh) 向量帶進位加法指令
WO2020246598A1 (fr) Dispositif, procédé et programme de calcul
US20150106603A1 (en) Method and apparatus of instruction scheduling using software pipelining
JP2020530151A (ja) データ処理装置における連続値の照合
JP2009507292A (ja) 分離したシリアルモジュールを備えるプロセッサアレイ
US11354126B2 (en) Data processing
US20230214351A1 (en) Reconfigurable simd engine

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20819387

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20819387

Country of ref document: EP

Kind code of ref document: A1