WO2019128404A1 - 矩阵乘法器 - Google Patents

矩阵乘法器 Download PDF

Info

Publication number
WO2019128404A1
WO2019128404A1 PCT/CN2018/111077 CN2018111077W WO2019128404A1 WO 2019128404 A1 WO2019128404 A1 WO 2019128404A1 CN 2018111077 W CN2018111077 W CN 2018111077W WO 2019128404 A1 WO2019128404 A1 WO 2019128404A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
memory
row
column
sub
Prior art date
Application number
PCT/CN2018/111077
Other languages
English (en)
French (fr)
Inventor
刘虎
廖恒
屠嘉晋
袁宏辉
林灏勋
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to KR1020207021471A priority Critical patent/KR102443546B1/ko
Priority to KR1020227031367A priority patent/KR102492477B1/ko
Priority to EP18895760.9A priority patent/EP3726399A4/en
Priority to JP2020536531A priority patent/JP6977239B2/ja
Publication of WO2019128404A1 publication Critical patent/WO2019128404A1/zh
Priority to US16/915,915 priority patent/US11334648B2/en
Priority to US17/725,492 priority patent/US11934481B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • G06F7/53Multiplying only in parallel-parallel fashion, i.e. both operands being entered in parallel
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5443Sum of products

Definitions

  • the present invention relates to the field of computing technologies, and in particular, to a matrix multiplier.
  • Method 1 Calculate by vector processor.
  • the vector processor will vector the ith row of matrix A (including elements A i1 , A i2 , ..., A i (M-1) , A iM ) is loaded into the source register Reg0, and then the jth column vector of matrix B (including elements B j1 , B j2 , ..., B j(M-1) , B jM ) is loaded into the register In Reg1, the multiplication between the corresponding elements of Reg0 and Reg1 can be realized. Finally, the accumulation operation is performed by the addition tree, and the data C ij of the i-th row and the j-th column of the matrix C is calculated, and the matrix C can be obtained by performing multiple calculations. .
  • Method 2 In order to further increase the calculation speed, the multiplication operation of the matrix can be completed by a two-dimensional calculation array.
  • a two-dimensional computing array can be a N*N pulsating array.
  • mode 1 completing two N*N matrix multiplication operations requires N ⁇ 3 multiplication operations. Since the vector processor can calculate the multiplication between M elements per clock cycle, the length of time required to complete a multiplication operation is It is N ⁇ 3/M clock cycles.
  • the technical problem to be solved by the embodiments of the present invention is to provide a matrix multiplier and related equipment to solve the problem of inflexible calculation and low efficiency in matrix multiplication.
  • an embodiment of the present invention provides a matrix multiplier, which may include:
  • a first memory configured to store a first matrix, where the first matrix is an M*K matrix
  • a second memory configured to store a second matrix, wherein the second matrix is a K*N matrix
  • An arithmetic circuit connected to the first memory and the second memory, the arithmetic circuit comprising X rows * Y column operation units, each of the operation units including a vector multiplication circuit and an addition circuit, the matrix multiplication circuit for receiving Row vector data transmitted by the first memory and column vector data transmitted by the second memory, and multiplying the two vectors; the adding circuit is configured to add the result of multiplying the two vectors, and Accumulating calculation results belonging to the same operation unit to obtain an operation result of each operation unit;
  • controller coupled to the arithmetic circuit, the controller for performing the following actions:
  • the controller is also configured to perform the following actions:
  • a matrix multiplier which uses a controller to perform a matrix multiplication block method, that is, MNK fractal, through the control logic of the internal controller 604 in the matrix multiplier 60.
  • the large matrix is split into unit matrix multiplications (ie, X*L x L*Y matrices).
  • the control logic of the controller 604 sends an identity matrix multiplication task to the arithmetic circuit 603 every clock cycle, so that the data pipeline is executed so that the X rows * Y column operation units operate at full capacity.
  • the matrix multiplier provided by the embodiment of the present invention can perform convolution operations and FC operations in a convolutional neural network.
  • the controller is specifically configured to perform the following actions:
  • the controller is further configured to control the row vector of the any one of the A sr to enter the row corresponding to the X row *Y column operation unit in the order of the x line number from small to large. x rows, and the time difference of adjacent row vectors in the operation units entering different rows of the same column is 1 clock cycle; the controller is further configured to simultaneously control the column vector of the corresponding sub-block B rt according to the y column number The sequence from small to large sequentially enters the yth row corresponding to the X row *Y column operation unit, and the time difference of the adjacent column vectors in the operation units entering the different columns of the same row is 1 clock cycle.
  • the controller is further configured to:
  • the values of s and r are unchanged, and the value of t varies to cause the first memory to be multiplied in the at least two consecutive sub-blocks
  • the sub-block multiplication calculation period is a time for the X-row*Y column operation unit to calculate a matrix multiplication operation of completing one sub-block A sr and the corresponding sub-block B rt .
  • the matrix multiplier further includes a third memory connected to the operation circuit;
  • the controller is configured to control, and the X row*Y column operation unit stores the calculation result of the vector multiplication circuit and the addition circuit to the third memory.
  • the matrix multiplier further includes a fourth memory connected to the first memory and the second memory, and a fifth memory connected to the third memory;
  • the controller is further configured to control before calculating the multiplication of the first matrix and the second matrix:
  • the vector multiplication circuit includes L multipliers; and the addition circuit includes an addition tree whose input number is (L+1).
  • the first memory, the second memory, the arithmetic circuit, and the controller are connected by a bus interface unit.
  • the matrix multiplier further includes a data handling unit, wherein the data handling unit is configured to rotate the first matrix before moving the first matrix to the first memory The operation of the matrix is performed, or the operation of the transposed matrix is performed on the second matrix before the second matrix is transported to the second memory.
  • the controller controls any one of the sub-blocks of the first matrix to be stored in the first memory in the form of a row, or to control any one of the sub-blocks of the second matrix to The form of the line is stored in the second memory. So that it can be read quickly when reading, and it is more flexible and fast when the sub-block is transposed.
  • the application provides an electronic device, which may include:
  • the security element provided by any of the above first aspects and the discrete device coupled to the chip.
  • the present application provides a system-on-chip chip, which includes the chip provided by any one of the foregoing first aspects.
  • the on-chip system chip chip can be composed of a chip, and can also include a chip and other discrete devices
  • FIG. 1 is a schematic diagram of a process of calculating two matrix products in the prior art
  • FIG. 2 is a schematic diagram of converting a convolution kernel into a weight matrix in the prior art
  • FIG. 3 is a schematic diagram of converting input data into an input matrix in the prior art
  • FIG. 4 is a schematic diagram of a method for multiplying two matrices in the prior art
  • FIG. 5 is a schematic diagram of a TPU pulsation array in the prior art
  • FIG. 6 is a structural diagram of a matrix multiplication accelerator according to an embodiment of the present invention.
  • FIG. 7 is a structural diagram of an operation unit 6030 according to an embodiment of the present invention.
  • FIG. 8 is a schematic diagram of a matrix partitioning according to an embodiment of the present invention.
  • FIG. 9 is a schematic diagram of a wiring in a specific operation circuit 603 according to an embodiment of the present invention.
  • FIG. 10 is a schematic diagram of wiring in another specific operation circuit 603 according to an embodiment of the present invention.
  • FIG. 11 is an input format of a matrix multiplier with a Base of 4 according to an embodiment of the present invention.
  • 16 is a schematic structural diagram of another matrix multiplier provided by an embodiment of the present invention.
  • 17 is a schematic structural diagram of still another matrix multiplier provided by an embodiment of the present invention.
  • FIG. 18 is a schematic diagram of an asynchronous execution sequence of instructions according to an embodiment of the present invention.
  • references to "an embodiment” herein mean that a particular feature, structure, or characteristic described in connection with the embodiments can be included in at least one embodiment of the present application.
  • the appearances of the phrases in various places in the specification are not necessarily referring to the same embodiments, and are not exclusive or alternative embodiments that are mutually exclusive. Those skilled in the art will understand and implicitly understand that the embodiments described herein can be combined with other embodiments.
  • a component can be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer.
  • an application running on a computing device and a computing device can be a component.
  • One or more components can reside within a process and/or execution thread, and the components can be located on one computer and/or distributed between two or more computers.
  • these components can execute from various computer readable media having various data structures stored thereon.
  • a component may, for example, be based on signals having one or more data packets (eg, data from two components interacting with another component between the local system, the distributed system, and/or the network, such as the Internet interacting with other systems) Communicate through local and/or remote processes.
  • data packets eg, data from two components interacting with another component between the local system, the distributed system, and/or the network, such as the Internet interacting with other systems
  • Convolutional neural networks mainly include convolution and fully connected (FC) operations, in which the computational complexity of convolution operations can usually account for more than 70% of the entire network operation.
  • Convolution operations are not strictly equivalent to matrix multiplication operations, but convolution operations can be converted to matrix multiplication operations through reasonable data adjustment.
  • convolutional neural networks there are usually multiple convolution kernels.
  • the convolution kernel is three-dimensional and contains three dimensions of data. The x and y directions are the length and width of the data, and the z direction can be considered as the depth of the data.
  • a convolution kernel is actually a filter, which is mainly used to extract different features in an image. Referring to FIG. 2, the convolution kernel is essentially a combination of a series of weights. Assuming that the number of convolution kernels is K, N elements in the z direction of the same position in the K convolution kernels are extracted, and N*K is obtained.
  • the convolution kernel can be pre-stored in the memory of the matrix multiplier in the form of a weight matrix.
  • "*" in the embodiment of the present invention means "multiplied by”.
  • the matrix multiplier can extract N data of the input M points in the z direction, and a total of M*N data.
  • the input matrix can be formed, and the matrix multiplier needs to multiply the input matrix and the weight matrix.
  • the FC operation is essentially a multiplication operation of a vector and a matrix.
  • the input of the FC operation is a vector of 9216, and the FC needs to output 4096 points.
  • a 9126 vector and 9216 weights are required for point multiplication.
  • 9216 is required.
  • the vector is multiplied by 9216x4096 weights.
  • A is a matrix of size M*K
  • B is a matrix of size K*N
  • M, N, and K are both Is a positive integer.
  • a systolic array calculation method such as Google's custom-made dedicated chip (ASIC) Google TPUv1 for machine learning, uses a pulsating array design to implement a 256 ⁇ 256 2-D MAC Array to optimize matrix multiplication. Convolution operation (as shown in Figure 5). Each Cell in the figure is a multiplier. When the multiplier multiplies an element in two matrices, the calculated result (Partial Sum, the intermediate result in matrix multiplication) is transferred down to the graph. The accumulation unit below is accumulated with the previous accumulated value. Thus, when the data is running at full load, the systolic array accumulates an intermediate value of the matrix size every clock cycle.
  • ASIC custom-made dedicated chip
  • the matrix multiplication calculation efficiency is low due to the low calculation density.
  • the pulsation array calculation size is relatively fixed, in order to reflect the computational efficiency of the pulsation array, the input and the weight need to be performed in many forms.
  • the conversion makes the operation inflexible; in addition, when doing matrix multiplication, the data needs a large number of sizes to achieve the effect of pipeline execution. For example, the 256x256 2D pulsation array is not efficient on a small matrix.
  • the related patent implements a M*K*N 3-D MAC Array, and the 2-D MAC Array scheme of TPUv1 and NVDLA further significantly improves the matrix multiplication calculation efficiency.
  • the invention proposes a new hardware accelerator architecture that enables a [NxN] matrix multiplication operation in a single clock cycle.
  • the number of processing units (PEs) included is NxNxN
  • the number of addition trees included is NxN.
  • a calculation method for splitting the large matrix into smaller matrices is also proposed.
  • the above solution needs to fill the matrix size to the size supported by the hardware, the data bandwidth is wasted and the calculation efficiency is lowered.
  • the software programming will be complicated, and the relative software programming amount will be greatly increased.
  • the accelerator can only load the elements in the matrix in a one-way cycle, the software needs to split the matrix by itself, so the calculation mode is single and inflexible; in addition, once the memory of the matrix A and the matrix B cannot hold all the data, there will be a repetition. Read. Therefore, the buffer size will be strongly dependent on the business algorithm, that is, the accelerator has a large dependence on the tightly coupled on-chip memory.
  • the technical problem to be solved by the present application is how to calculate a large amount of data operations in a convolutional neural network using hardware with high efficiency, flexibility, and low power consumption.
  • the matrix multiplier provided by the embodiments of the present invention can be applied not only to the fields of machine learning, deep learning, and convolutional neural networks, but also to the fields of digital image processing and digital signal processing, and can also be applied to other fields. Fields involving matrix multiplication operations.
  • FIG. 6 is a structural diagram of a matrix multiplication accelerator according to an embodiment of the present invention.
  • the matrix multiplier 60 includes: a first memory 601, a second memory 602, an operation circuit 603, and a control.
  • the 604 wherein the arithmetic circuit 603 and the first memory 601, the second memory 602, and the controller 604 can perform data communication via a bus.
  • the operation circuit 603 is configured to extract the matrix data in the first memory 601 and the second memory 602 and perform vector multiplication and addition operations, and the controller 604 is configured to control the operation circuit 603 to complete the vector operation according to a preset program or instruction. . among them,
  • the first memory 601 is for storing the first matrix.
  • the first memory 601 mentioned in the embodiment of the present invention, and the internal memory of the second memory 602, the third memory 606 and the related matrix multiplier mentioned hereinafter may each be a register, a random access memory (random access memory, Referred to as RAM), static random access memory, flash memory or other readable and writable memory.
  • the first matrix and the second matrix and the data type of the operation result may be of the type of int 8, fp16, or fp32.
  • M, K, and N, and X and Y are integers greater than 0, and any two of M, N, and K may be equal or unequal, and M, N, and K may all be equal or unequal, X and Y may be equal and may not be equal, and this application does not specifically limit this.
  • the operation circuit 603 may include an X-row*Y column operation unit 6030 (which may be referred to as a multiply-accumulate unit MAC), and each of the operation units may perform a vector multiplication operation independently.
  • the operation circuit 603 includes 4*4
  • the operation unit 6030 is provided with two inputs for receiving the row vector and the column vector sent by the first memory 601 and the second memory 602, and performing vector multiplication on the row vector and the column vector.
  • an operation unit 6030 includes a vector multiplication circuit and an addition circuit, wherein the matrix multiplication circuit is configured to receive the row vector data sent by the first memory 601 and the column vector data sent by the second memory 602, and the two-way vector Multiplying; the adding circuit is used to add the result of multiplying the two vectors, and accumulating the calculation results belonging to the same operation unit to obtain the calculation result of each operation unit.
  • the matrix multiplier 60 further includes a third memory 605 for storing the operation results of the vector multiplication circuit and the addition circuit, and different clock cycles.
  • the third memory 605 in the present application may include X*Y storage units, and each storage unit is configured to store each operation result of the corresponding operation unit.
  • each of the arithmetic units corresponds to a designated storage space in the third memory 605 for storing the result of each operation.
  • the controller 604 can calculate the product of the first matrix and the second matrix by performing the following actions:
  • the controller 604 blocks the first matrix in units of sub-blocks of size X*L to obtain S ⁇ R sub-blocks of the same size, wherein the s-th row of the S ⁇ R sub-blocks
  • the division is performed by dividing the first matrix into units of sub-blocks of X*L.
  • the purpose of the block is to split the large matrix into a plurality of small matrices that conform to the size of the matrix multiplier, and then calculate the small matrix in a certain order and accumulate the values of the related small matrices, and finally obtain Matrix multiplication results. Not only can it be flexibly calculated, it can facilitate subsequent multiplexing and multi-level caching, further improve computing efficiency and reduce data handling bandwidth and energy consumption.
  • the first matrix is an M*K matrix, and there may be a case where the first matrix cannot be divided into an integer number by the X*L sub-blocks. Therefore, when M/X or K/L is not an integer, the operation can be performed in such a manner that the 0 element is filled and filled. Or, it does not participate in the calculation at the corresponding position, and assigns the result to 0. specifically, When M%X ⁇ 0, the (M+1)th to the (S*XM)th rows of the first matrix are not calculated and the result is assigned to 0; when K%Y ⁇ 0, the first matrix The first (K+1), to the R*YK line are not calculated, and the result is assigned a value of zero.
  • the substantial multiplication calculation is not performed by the arithmetic unit, but is treated as if it has been calculated but the result is 0, so that the reading of the corresponding arithmetic unit can be saved. And computing power consumption.
  • the controller 604 further controls to block the second matrix by using a sub-block of size L*Y to obtain R ⁇ T sub-blocks of the same size, wherein the R ⁇ T sub-blocks are obtained.
  • the second matrix is a K*N matrix, and there may be a case where the second matrix cannot be divided into an integer number by the L*Y sub-blocks. Therefore, when K/L or N/Y is not an integer, the operation can be performed in such a manner that the 0 element is filled and filled. Or, it does not participate in the calculation at the corresponding position, and assigns the result to 0. specifically, When K%Y ⁇ 0, the (K+1)th to the (R*YK) of the first matrix are not calculated and the result is assigned to 0; when N%X ⁇ 0, the first matrix The (N+1) to (T*XN) lines are not evaluated and the result is assigned a value of zero.
  • the substantial multiplication calculation is not performed by the arithmetic unit, but is treated as if it has been calculated but the result is 0, so that the reading of the corresponding arithmetic unit can be saved. And computing power consumption.
  • operations may be performed sequentially according to the size of s in sub-block A sr and corresponding B rt , or the order of size of t.
  • the first matrix is an M*K matrix
  • Each element in B is actually a matrix of L*Y or 3*4.
  • matrix multiplication of the sub-blocks needs to be performed on any one of the first matrices, that is, each sub-block A sr and the corresponding sub-block B rt in the second matrix.
  • first matrices that is, each sub-block A sr and the corresponding sub-block B rt in the second matrix.
  • sub-block A 11 and sub-block B 11 may be input, and all row vectors of A 11 and corresponding B are input in the first sub-block multiplication calculation period (which can be understood as the first round). All column vectors in 11 are evaluated.
  • the second sub-block multiplication calculation period (which can be understood as the second round) all the row vectors of A 12 and all the column vectors in the corresponding B 21 are operated, so that the result can be obtained by accumulating the arithmetic unit.
  • the result points of all positions on the result matrix C can be obtained.
  • C 11 A 11 B 11 +A 12 B 21 , and among them,
  • C 11 is actually a 4*4 matrix. Therefore, according to the calculation rule of the matrix, the finally obtained matrix C is the result matrix of M*N, that is, the result matrix of 12*12.
  • Manner 2 multiplexing one of the sub-blocks according to a certain rule.
  • the embodiment of the present invention provides a sub-block multiplexing manner to invoke one sub-block A sr in the first matrix and the corresponding sub-block B rt in the second matrix. Matrix multiplication of blocks.
  • the controller 604 is further configured to: during the at least two consecutive sub-block multiplication calculation periods, the values of the s and r are unchanged, and the value of the t changes to cause the first memory Compulsing the same A sr in the at least two consecutive sub-block multiplication calculation periods, wherein the sub-block multiplication calculation period calculates a sub-block A sr and corresponding to the X-row *Y column operation unit The time taken for the matrix multiplication of the sub-block B rt .
  • the third sub-block multiplication calculation period (which can be understood as the third round) all the row vectors of A 11 and all the column vectors in the corresponding sub-block B 13 are operated.
  • the A1 in the first storage can be reused , the read/write overhead is saved, and the data handling bandwidth is reduced.
  • the calculation rule for a sub-block A sr in the first matrix and the corresponding sub-block B rt in the second matrix in a sub-block multiplication calculation period is, in the first matrix,
  • the xth row of the X row vectors of any one sub-block A sr and the yth column of the Y column vectors of the corresponding sub-block B rt are input to the xth row in the X row *Y column operation unit
  • any one of the sub-blocks A sr r is equal to the value of r in the corresponding sub-block B rt . That is, for A sr and the corresponding sub-block B rt in the second matrix, any one of the row vectors and any one of the column vectors are input to the specified arithmetic unit in the X-row *Y column operation unit for calculation.
  • Subblock B 11 is, for example, A 11 in the second row vector [a 21 a 22 a 23] , wherein the second matrix and a column vector corresponding to the third That is, the operation is performed in the arithmetic unit corresponding to the second row and the third column in the X row * Y column operation unit, and so on.
  • FIG. 9 is a schematic diagram of a wiring in a specific operation circuit 603 according to an embodiment of the present invention, based on the arrangement of the operation units in the operation circuit 603 shown in FIG. 6 .
  • the BUFA is the first memory 601 of the first matrix
  • the BUFB is the second memory 602 of the second matrix
  • the BUFC is the third memory 605 storing the calculation result of each operation unit 6030
  • the size of A is (M*base)x(K*base)
  • the size of B is (K*base)x(N*base)
  • the size of C is (M*base)x(N*base).
  • Base is the basic size of the arithmetic circuit 603, that is, X*Y, for example, 8*8, 16*16, 32*32, and the like.
  • MNK matrix multiplication (and accumulation).
  • MNK's matrix multiplication will be in fractal mode, and the controller will control the basic matrix multiplication of decomposing the large matrix into base size and combine them in a specific order (method 1 or 2 mentioned above).
  • the medium MAC Group in Figure 7 is an N*N (4*4) multiply-accumulator group, by N(4).
  • a multiplication unit consisting of an N+1 (5) input accumulation tree.
  • the multiply accumulator can operate on a row and multiply it by a column and accumulate (ie, an element in the result matrix).
  • the schematic diagram of the wiring in FIG. 9 can support the arithmetic circuit 603 to complete the matrix multiplication calculation of one sub-block A sr and the corresponding sub-block B rt in the same clock cycle. Because all X row vectors of A sr and all Y column vectors of the corresponding sub-block B rt can reach the corresponding operation unit 6030 from the corresponding BUFA and BUFB simultaneously through the wiring manner in FIG. 9 , therefore, The controller 604 can control the operation circuit 603 to complete the multiplication calculation of one sub-block A sr and the corresponding sub-block B rt in one clock cycle, and complete another sub-block A sr and the corresponding sub-block B rt in the next clock cycle. , or, matrix multiplication calculation of the same sub-block A sr and the corresponding other sub-block B rt .
  • FIG. 10 is a schematic diagram of a wiring in another specific operation circuit 603 according to an embodiment of the present invention.
  • the controller 604 is configured to control the row vectors of any one of the A sr to sequentially enter the xth row corresponding to the X row *Y column operation unit in the order of the x line numbers from small to large, and the adjacent row vectors are entering.
  • the time difference of the operation unit to the different rows of the same column is 1 clock cycle;
  • the controller is further configured to simultaneously control the column vectors of the corresponding sub-block B rt to enter the sequence according to the y column number from small to large.
  • the yth row corresponding to the X row *Y column operation unit, and the time difference of the adjacent column vector in the operation unit entering the different columns of the same row is 1 clock cycle.
  • the fractal matrix multiplication unit in the embodiment of the present invention may be a pulsating array structure, and the structure different from the TPUv1 is that the amount of data transmitted per pulsation is L ( There is one in TPUv1, therefore, the parallelism of data operations is greater than the pulsation array in TPUv1.
  • BUFA/B is respectively used to buffer the memory of the first matrix/second matrix, and the first matrix buffer device (BUFA) in FIG. 10 will be in the A matrix.
  • the unit matrix is divided into X rows and the L elements of the same row are sequentially sent to one of the arithmetic units in the pulsating array in each clock cycle.
  • the second matrix buffer device (BUFB) divides the unit matrix in the second matrix into Y columns and sequentially feeds L elements of the same column into the pulsing array in each clock cycle.
  • the specific timing is as follows:
  • the BUFC is a cache device that stores the "C" (offset) matrix in the "A*B+C” calculation (can be built by the L0 cache or buffer registers), and the intermediate value in the matrix multiplication can also be stored with the BUFC.
  • the multiply accumulator performs the multiplication, the accumulating tree accumulates the L intermediate values that have been multiplied with one offset or intermediate value stored in the BUFC.
  • the controller 603 in the matrix multiplier 60 will multiply the matrix into the format of FIG. 11 for a total of 8 4x4 unit matrix operations. .
  • MNK matrix multiplication operation
  • the rules can be operated in the order of the above method 1 and mode 2. It can be understood that the strategy of maximizing data multiplexing in mode 2 can reduce data reading. Take power consumption.
  • the control logic of the controller 603 transfers the eight fractals into the pulsating array in eight clock cycles, as shown in FIGS. 12 to 15.
  • the matrix multiplier 60 may further include an instruction distribution unit 606, an instruction fetching unit 607, a data handling unit 608, a vector unit 609, a scalar unit 610, and a bus interface unit 611.
  • the matrix multiplier 60 provided by the embodiment of the present invention can be mounted as a coprocessor to a central processing unit (CPU) 80, and the CPU allocates a computing task to the matrix multiplier 60.
  • the CPU 80 The first matrix and the second matrix and associated instructions may be stored in external memory 70, which may perform matrix multiplication by reading the first and second matrices in the external memory 70 and associated instructions.
  • the external memory 70 may specifically be a Double Data Rate Synchronous Dynamic Random Access Memory (DDR) or other readable and writable memory.
  • the external memory may be a memory private to the matrix multiplier 60.
  • the first memory 601, the second memory 602, the third memory 605, and the external memory 70 are generally On-Chip Buffers, wherein
  • Vector Unit 609 Contains various types of parallelism computing devices (such as floating point multiplication, floating point addition, floating point size comparison, etc.) for executing SIMD (Single Instruction multiple data) instructions. And it is responsible for direct data transfer between the Unified Buffer and the L0C cache (ie, the first memory, the second memory, and the third memory).
  • SIMD Single Instruction multiple data
  • Scalar Unit 610 Contains various types of shaping basic arithmetic devices (such as addition, multiplication, comparison, shifting, etc).
  • DMA Unit Data Transfer Unit
  • the matrix needs to be stored according to the result of the block, for example, for a 2*2 matrix
  • the first row in the first matrix Sub-block of the first column Stored in units of blocks, A0, A1, A2, and A3 are stored in one row, and so on.
  • the storage may be performed in the above manner, when the operation unit needs to be read. It can also be read according to the above storage order, which is convenient for transposition when the row vector needs to be transposed into a column vector during calculation.
  • IFU Instruction Fetch Unit
  • PC program counter
  • IM instruction memory
  • the instruction dispatch unit 606 parses the instruction transmitted by the fetching unit and submits the type instruction corresponding to the instruction to the four pipeline units, wherein the pipeline unit is the scalar unit in FIG. 16 (Scalar) Unit), Direct Memory Access Unit (DMAUnit) unit, Vector Unit, Fractal Matrix Multiplication Unit.
  • the instruction dispatch unit has a mechanism to control the execution order between the four pipelines.
  • the above pipeline unit has two types of asynchronous execution (Posted Execution) and synchronous execution. All types of instructions are guaranteed to be transmitted. The difference is that the asynchronous execution unit execution instruction ends asynchronously, and the synchronous execution unit ends with synchronization; wherein the Scalar Unit is synchronously executed; the Fractal Mat Mult Unit and the DMA unit The Vector Unit is executed asynchronously.
  • a configurable path matrix transposition function is provided in the embodiment of the present invention.
  • the data transfer unit is in transit.
  • the operations of the transposed matrix are made and stored in the order of the matrix after the transposition. Because matrix transposition is the necessary operation of the neural network training process.
  • the transport instructions that can be configured to be transposed with the matrix in the embodiment of the present invention are more flexible, and the software is easier and simpler. As shown in the table below,
  • the common handling instructions are compared with the handling instructions that can be configured with the matrix transpose function.
  • the same instruction can support more application scenarios with different parameters.
  • a configurable path matrix transpose method for fractal matrix multiplying processor architecture is designed.
  • the embodiment of the present invention further provides a storage structure adopting multi-level cache. All arithmetic units can read and write interactive data through a unified buffer.
  • the matrix multiplier has two levels of dedicated caches, L1 and L0.
  • the L1 cache and the unified cache exchange data through the data transfer DMA unit and the external storage space; the external storage space is composed of multi-level storage units.
  • the matrix multiplier contains multi-level caches, from L0, to L1, to L2 cache, with increasing capacity, decreasing access bandwidth, increasing latency, and increasing power consumption overhead.
  • L0 is the innermost buffer and can be used to cache the three matrices of the "first matrix", the "second matrix", and the "result matrix” in the MNK multiply instruction. Due to close calculation, the bandwidth and delay requirements are the highest, and the chance of data reuse is the greatest. It can be built with D flip-flops (DFF) to improve performance and reduce power consumption.
  • DFF D flip-flops
  • the source and destination operands of the fractal instruction are from L1 (the fifth memory 612 and the fourth memory 613 in Fig. 17). Data is multiplexed by L0 (i.e., the first memory 601 and the second memory 602 in Fig. 17) during execution. Software above the fractal command can reuse data with L1.
  • Multi-level cache data reuse can be achieved by using the execution order of the fractal instructions and the software control sequence above the fractal instructions. At the same time, with the reuse of multi-level cache data, the data transfer time of data in each cache can also be masked.
  • An example of the following chart illustrates data multiplexing and handling between multi-level caches:
  • controller 60 reads the A0, B0 portion of the matrix from the L1 buffer and stores it in L0.
  • the A0 and B0 fractal matrices can already be read from L0 and participate in the operation.
  • the hardware will read the B1 fractal from L1 and store it in L0 to prepare for the next operation, and the data reading time will be It will also be masked by calculations.
  • the hardware does not need to read two fractal matrices at the same time, but a read-only B1 matrix. Since the matrix calculation of time 3 "A0*B1" multiplexes the data A0 stored at time 1. Referring to the above list, it can be seen that in the subsequent calculations, there is data multiplexing for each time unit.
  • the embodiment of the present invention is not limited to data transfer between L1 and L0, and data transfer to L1 cache at L2 (for example, external memory 701 and external memory 702) can also utilize data reusability to achieve reduced bandwidth. To optimize the purpose of energy consumption.
  • the embodiment of the present invention does not limit the manner in which matrix data is split and the order of handling. Data handling should maximize data multiplexing so that the fractal matrix calculations are run at full load in each time unit.
  • the multi-level cache structure, the data reuse of the matrix fractal, the execution order of the fractal instruction, and the software control sequence above the fractal instruction can realize that the multi-level cache data reuse reduces the dependence on the tightly coupled on-chip memory. Optimized energy efficiency and reduced software programming complexity.
  • the execution sequence of the instruction for multiplying the matrix includes two steps: synchronous execution and asynchronous execution:
  • a series of control preparation and data preparation such as matrix size calculation, matrix data reading, target address calculation, etc., are required before the execution of the fractal matrix multiplication instruction. If the processor's instruction execution strategy is synchronous, that is, all instructions need to be committed in order, the instruction will most likely wait for the unrelated instruction to finish before starting execution. This can result in large and unnecessary performance loss, as follows: The order of execution is synchronized for instructions.
  • the hardware instruction distribution unit 606 adopts multi-channel order-preserving transmission, thereby ensuring that different types of instructions can be simultaneously executed in sequence.
  • the control preparation and address calculation are performed in the scalar channel
  • the matrix reading and storage are performed in the data transfer channel
  • the matrix multiplication calculation is also performed in the matrix operation channel.
  • Each channel can overlap with each other and the interdependent instructions can be synchronized by setting a Wait Flag.
  • the effect is as shown in FIG. 18.
  • the instructions are not saved, and the related dependent instructions can be synchronized by the wait instruction added by the software.
  • This asynchronous execution can mask the control preparation overhead of fractal matrix multiplication.
  • An asynchronous execution method suitable for fractal matrix multiplication programming is designed.
  • a matrix multiplier which utilizes a controller to perform a matrix multiplication block method, i.e., MNK fractal, which splits the large matrix by the control logic of the internal controller 604 in the matrix multiplier 60. Multiply the unit matrix (ie the matrix of X*LxL*Y).
  • the control logic of the controller 604 sends an identity matrix multiplication task to the arithmetic circuit 603 every clock cycle, so that the data pipeline is executed so that the X rows * Y column operation units operate at full capacity.
  • the matrix multiplier provided by the embodiment of the present invention can perform convolution operations and FC operations in a convolutional neural network.
  • the above embodiments it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof.
  • a software program it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer program instructions When the computer program instructions are loaded and executed on a computer, the processes or functions described in accordance with embodiments of the present application are generated in whole or in part.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • the computer instructions can be stored in a computer readable storage medium or transferred from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions can be from a website site, computer, server or data center Transmission to another website site, computer, server, or data center by wire (eg, coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • the computer readable storage medium can be any available media that can be accessed by a computer or a data storage device that includes one or more servers, data centers, etc. that can be integrated with the media.
  • the usable medium may be a magnetic medium (eg, a floppy disk, a hard disk, a magnetic tape), an optical medium (eg, a DVD), or a semiconductor medium (such as a Solid State Disk (SSD)).
  • SSD Solid State Disk

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Complex Calculations (AREA)
  • Steroid Compounds (AREA)
  • Holo Graphy (AREA)
  • Magnetic Resonance Imaging Apparatus (AREA)

Abstract

本发明实施例公开了一种矩阵乘法器,涉及数据计算技术领域,旨在对两个矩阵进行分块计算。该矩阵乘法器包括:第一存储器、第二存储器、运算电路和控制器,其中,运算电路与第一存储器和第二存储器可以通过总线进行数据通信,控制器用于依据预设的程序或者指令控制第一矩阵和第二矩阵进行分块,并控制运算电路根据控制器的分块结果对第一存储器和第二存储器中的对应分块进行乘法运算。该矩阵乘法器可以用于对两个矩阵进行相乘运算。

Description

矩阵乘法器 技术领域
本发明涉及计算技术领域,尤其涉及一种矩阵乘法器。
背景技术
目前,要计算两个矩阵A和B的乘积,可以通过以下两种方式中的任意一种方式进行计算:
方式一、通过向量处理器进行计算。
假设C=A*B,向量处理器可同时计算的元素个数为M个,参见图1,向量处理器会将矩阵A的第i行向量(包括元素A i1、A i2、…、A i(M-1)、A iM)加载到源寄存器Reg0中,再将矩阵B的第j列向量(包括元素B j1、B j2、…、B j(M-1)、B jM)加载到寄存器Reg1中,可以实现Reg0与Reg1各对应元素之间的乘法,最后通过加法树来完成累加操作,计算出矩阵C的第i行第j列的数据C ij,进行多次计算则可以得到矩阵C。
方式二、为了进一步提高计算速度,可以通过二维的计算阵列来完成矩阵的乘法运算。
例如,二维的计算阵列可以为N*N的脉动阵列。在方式一中,完成两个N*N的矩阵乘法运算,需要N^3次乘法操作,由于向量处理器每个时钟周期可以计算M个元素之间的乘法,因此完成一次乘法运算所需时长为N^3/M个时钟周期。在方式二中,完成两个N*N的矩阵乘法运算,需要N^3次乘法操作,由于脉动阵列有N^2个运算单元,那么完成一次矩阵运算所需时长为N^3/N^2=N个时钟周期。方式一和方式二完成N*N的矩阵乘法运算耗费的时间都很长,并且都有计算尺寸相对固定,不灵活的问题。
发明内容
本发明实施例所要解决的技术问题在于,提供一种矩阵乘法器及相关设备解决了矩阵乘法中,计算不灵活、效率低的问题。
第一方面,本发明实施例提供了一种矩阵乘法器,可包括:
第一存储器,用于存储第一矩阵,所述第一矩阵为M*K矩阵;
第二存储器,用于存储第二矩阵,所述第二矩阵为K*N矩阵;
与所述第一存储器和所述第二存储器连接的运算电路,所述运算电路包括X行*Y列运算单元,每个运算单元包括向量乘法电路和加法电路,所述矩阵乘法电路用于接收所述第一存储器发送的行向量数据和第二存储器发送的列向量数据,并将所述两路向量相乘;所述加法电路用于对所述两路向量相乘的结果相加,以及对属于同一个运算单元的计算结果进行累加得到每个运算单元的运算结果;
与所述运算电路连接的控制器,所述控制器用于执行以下动作:
将所述第一矩阵以规模为X*L的子块为单位进行分块,得到S×R个相同大小的子块,其中,将所述S×R个子块的第s行第r列的子块记为A sr,s=(1、2、3、……S),r=(1、2、3、……R);
将所述第二矩阵以规模为L*Y的子块为单位进行分块,得到R×T个相同大小的子块, 其中,将所述R×T个子块中的第r行第t列记为B rt,r=(1、2、3、……R),t=(1、2、3、……T);
所述控制器还用于执行以下动作:
将任意一个子块A sr的X个行向量中的第x行和对应的子块B rt的Y个列向量中的第y列,输入到所述X行*Y列运算单元中的第x行第y列的运算单元中进行运算,x=(1、2、3、……X),y=(1、2、3、……Y),其中,所述任意一个子块A sr中的r和所述对应的子块B rt中的r的取值相等。
本发明实施例,提供了一种矩阵乘法器,该矩阵乘法器利用控制器完成一种矩阵相乘的分块方法,即MNK分形,通过矩阵乘法器60中的内部的控制器604的控制逻辑将大矩阵拆分为单位矩阵相乘(即X*L x L*Y的矩阵)。控制器604的控制逻辑会在每个时钟周期向运算电路603发送单位矩阵乘法任务,使得数据流水执行,使得X行*Y列运算单元满负荷运作。提高矩阵乘法的效率,达到显著的提高神经网络算法的应用效果。本发明实施例提供的矩阵乘法器可以进行卷积神经网络中的卷积运算和FC运算。
在一种可能的实现方式中,所述控制器具体用于执行以下动作:
将任意一个子块A sr的X个行向量中的第x行和对应的子块B rt的Y个列向量中的第y列,在同一个时钟周期并行输入到所述X行*Y列运算单元中的第x行第y列的运算单元中进行运算。
在一种可能的实现方式中,所述控制器还用于控制所述任意一个A sr的行向量按照x行号从小到大的顺序依次进入到所述X行*Y列运算单元对应的第x行,并且相邻行向量在进入到同一列不同行的运算单元的时间差为1个时钟周期;所述控制器还用于同时控制所述对应的子块B rt的列向量按照y列号从小到大的顺序依次进入到所述X行*Y列运算单元对应的第y行,并且相邻列向量在进入到同一行不同列的运算单元的时间差为1个时钟周期。
在一种可能的实现方式中,所述控制器还用于控制:
在至少两个连续的子块乘法计算周期内,所述s和r的取值不变,所述t的取值变化,以使得所述第一存储器在所述至少两个连续的子块乘法计算周期内复用同一个A sr,其中,所述子块乘法计算周期为所述X行*Y列运算单元计算完成一个子块A sr和对应的子块B rt的矩阵乘法运算所用的时间。
在一种可能的实现方式中,所述矩阵乘法器还包括与所述运算电路连接的第三存储器;
所述控制器用于控制,所述X行*Y列运算单元将向量乘法电路和加法电路的计算结果存储至所述第三存储器。
在一种可能的实现方式中,所述矩阵乘法器还包括,与所述第一存储器和所述第二存储器相连的第四存储器,以及,与所述第三存储器相连的第五存储器;
所述控制器还用于控制在计算第一矩阵和第二矩阵的乘法运算之前:
将所述第一矩阵和第二矩阵的数据源从所述第四存储器分别搬运至所述第一存储器和所述第二存储器;并将所述计算结果从第三存储器搬运至所述第五存储。
在一种可能的实现方式中,所述向量乘法电路包括L个乘法器;所述加法电路包括输入个数为(L+1)的加法树。
在一种可能的实现方式中,与所述第一存储器、第二存储器、所述运算电路以及所述 控制器通过总线接口单元相连接。
在一种可能的实现方式中,
Figure PCTCN2018111077-appb-000001
当M%X≠0,所述第一矩阵的第(M+1)到第(S*X-M)行均不计算,并将结果赋值为0,当K%Y≠0,所述第一矩阵的第(K+1),到第R*Y-K行均不计算,并将结果赋值为0;
在一种可能的实现方式中,
Figure PCTCN2018111077-appb-000002
当K%Y≠0,所述第一矩阵的第(K+1)到第(R*Y-K)均用不计算,并将结果赋值为0,当N%X≠0,所述第一矩阵的第(N+1)到第(T*X-N)行不计算,并将结果赋值为0。
在一种可能的实现方式中,所述矩阵乘法器还包括数据搬运单元,所述数据搬运单元用于在将所述第一矩阵搬运至所述第一存储器之前对所述第一矩阵进行转置矩阵的操作,或者在将所述第二矩阵搬运至所述第二存储器之前对所述第二矩阵进行转置矩阵的操作。
在一种可能的实现方式中,所述控制器控制所述第一矩阵的任意一个子块以行的形式存储在所述第一存储器中,或者控制所述第二矩阵的任意一个子块以行的形式存储在所述第二存储器中。以便于在读取的时候可以快速读取出来,并且在子块进行转置的时候更加灵活、快捷。
第二方面,本申请提供一种电子设备,可包括:
上述第一方面中的任意一种实现方式所提供的安全元件以及耦合于所述芯片的分立器件。
第三方面,本申请提供一种片上系统芯片,该片上系统芯片芯片包括上述第一方面的任意一种实现方式所提供的芯片。该片上系统芯片芯片,可以由芯片构成,也可以包含芯片和其他分立器件
附图说明
为了更清楚地说明本发明实施例或背景技术中的技术方案,下面将对本发明实施例或背景技术中所需要使用的附图进行说明。
图1为现有技术中的计算两个矩阵乘积的过程示意图;
图2为现有技术中的将卷积核转换为权重矩阵的示意图;
图3为现有技术中的将输入数据转换为输入矩阵的示意图;
图4为现有技术中的两个矩阵进行乘法运算的方法示意图;
图5为现有技术中的TPU脉动阵列示意图;
图6为本发明实施例提供的一种矩阵乘法加速器的结构图
图7为本发明实施例提供的一种运算单元6030的结构图;
图8为本发明实施例提供的一种矩阵分块示意图;
图9为本发明实施例提供的一种具体的运算电路603中的布线的示意图;
图10为本发明实施例提供的另一种具体的运算电路603中的布线的示意图;
图11为本发明实施例提供的Base为4的矩阵乘法器的输入格式;
图12为T=0时刻的M=2N=2K=2的矩阵乘法器的流水执行示意图;
图13为T=1时刻的M=2N=2K=2矩阵乘法器的流水执行示意图;
图14为T=7时刻的M=2N=2K=2矩阵乘法器的流水执行示意图;
图15为T=11时刻的M=2N=2K=2矩阵乘法器的流水执行示意图;
图16是本发明实施例提供的另一种矩阵乘法器的结构示意图;
图17是本发明实施例提供的又一种矩阵乘法器的结构示意图;
图18是本发明实施例提供的一种指令异步执行顺序示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例进行描述。
本申请的说明书和权利要求书及所述附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
在本说明书中使用的术语“部件”、“模块”、“系统”等用于表示计算机相关的实体、硬件、固件、硬件和软件的组合、软件、或执行中的软件。例如,部件可以是但不限于,在处理器上运行的进程、处理器、对象、可执行文件、执行线程、程序和/或计算机。通过图示,在计算设备上运行的应用和计算设备都可以是部件。一个或多个部件可驻留在进程和/或执行线程中,部件可位于一个计算机上和/或分布在2个或更多个计算机之间。此外,这些部件可从在上面存储有各种数据结构的各种计算机可读介质执行。部件可例如根据具有一个或多个数据分组(例如来自与本地系统、分布式系统和/或网络间的另一部件交互的二个部件的数据,例如通过信号与其它系统交互的互联网)的信号通过本地和/或远程进程来通信。
其次,提出本申请需要解决的技术问题及应用场景。近些年,由于卷积神经网络在图像分类、图像识别、音频识别以及其他相关领域的不俗表现,使其成为了学术界与工业界的研究与开发热门。卷积神经网络主要包括卷积和全连接(fully connected,FC)运算,其中卷积运算的运算量通常能够占据整个网络运算量的70%以上。
卷积运算从严格意义上不等同与矩阵乘法运算,但通过合理的数据调整,可以将卷积运算转换成矩阵乘法运算。在卷积神经网络中,通常会有多个卷积核,卷积核是三维的,包含三个维度的数据,x、y方向为数据的长和宽,z方向可以认为是数据的深度。卷积核其实就是滤波器(filter),主要用于提取图像中不同的特征。参见图2,卷积核实质上是一 系列权重的组合,假设卷积核个数为K,将K个卷积核中同一位置z方向上的N个元素提取出来,即可得到N*K的权重矩阵(weight matrix),根据矩阵乘法器的规格(即矩阵乘法器可以计算的矩阵的行数和列数),可以将卷积核按照权重矩阵的形式预先存储在矩阵乘法器的存储器中,以便在矩阵乘法器进行矩阵乘法运算时进行调用。本发明实施例中的“*”表示“乘以”。
参见图3,根据卷积核的步幅(stride)(本发明实施例中步幅为1),矩阵乘法器可提取输入的M个点在z方向的N个数据,共M*N个数据,可以形成输入矩阵(input matrix),矩阵乘法器需要对输入矩阵和权重矩阵进行乘法运算。
FC运算本质上是一个向量与矩阵的乘法操作。FC运算的输入为一个9216的向量,FC需要输出4096个点,那么要得到FC输出的一个点,需要一个9126的向量与9216个权重进行点乘操作,要得到所有4096个点,需要9216的向量与9216x4096个权重进行点乘操作。
图4示出了矩阵C=A*B的计算公式,其中,A为尺寸为M*K的矩阵,B为尺寸为K*N的矩阵,在本发明实施例中,M、N和K均为正整数。要计算得到C矩阵中的一个数据,需要矩阵A中的一个行向量中的数据与矩阵B中的一个列向量中的对应数据做点乘运算后再累加,即要计算得到C矩阵中的一个数据要进行N次乘法运算,则要计算得到矩阵C需要进行M*N*K次乘法运算。
现有技术中,脉动阵列计算方式,例如Google为机器学习定制的专用芯片(ASIC)Google TPUv1,TPU使用了脉动阵列的设计,实现了一个256×256的2-D MAC Array用来优化矩阵乘法与卷积运算(如图5所示)。图中每个Cell为一个乘法器,当乘法器将两个矩阵中的某一元素相乘后,计算得出的结果(Partial Sum,即矩阵乘法中的中间结果)会向下传输到图中下方的累加单元并与上一次的相关累加值进行累加。如此,当数据满负荷运转时,该脉动阵列会在每一个时钟周期累加一个矩阵大小的中间值。上述方案中,由于计算密度低,导致矩阵乘法计算效率较低;其次,在卷积运算时,由于脉动阵列计算尺寸比较固定,为了体现脉动阵列的运算效率,需要对输入和权重进行很多形式上的转换,导致运算不灵活;再者,在做矩阵乘法时,数据需要大批尺寸才可达到流水执行的效果,比如,256x256的2D脉动阵列在小矩阵上的计算效率不高。
另有,相关专利实现了一个M*K*N的3-D MAC Array,相对TPUv1和NVDLA的2-D MAC Array方案,进一步显著提高了矩阵乘法计算效率。该发明提出了一种新的硬件加速器架构,使其能够在单个时钟周期内完成一个[NxN]的矩阵乘法运算。在该硬件架构中,包含的处理单元(PE)个数为NxNxN,包含的加法树的个数为NxN。同时也提出了一种将大矩阵拆分成更小矩阵的计算方法。但是由于上述方案需要将矩阵尺寸补齐到硬件所支持的尺寸,浪费了数据带宽并且降低了计算效率。若将矩阵人为拆分为大矩阵加小矩阵的方式则会导致软件编程复杂,相对的软件编程量也会大幅增加。并且由于该加速器只能单向循环加载矩阵中的元素,需要软件自行拆分矩阵,因此计算模式单一、不灵活;再者,一旦矩阵A和矩阵B的存储器不能装下所有数据,会出现重复读取。所以buffer size会较强的依赖业务算法,即该加速器对紧耦合片上存储器依赖大。
因此,本申请所要解决的技术问题在于,如何使用硬件高效率、灵活且低能耗地对卷 积神经网络中的大量的数据运算进行计算。
可以理解的是,本发明实施例提供的矩阵乘法器不仅可以应用于机器学习、深度学习以及卷积神经网络等领域,也可以应用到数字图像处理和数字信号处理等领域,还可以应用在其他涉及矩阵乘法运算的领域。
基于上述分析,本申请提供一种矩阵乘法加速器,对本申请中提出的技术问题进行具体分析和解决。请参见图6,图6是本发明实施例提供的一种矩阵乘法加速器的结构图,如图6所示,矩阵乘法器60包括:第一存储器601、第二存储器602、运算电路603和控制器604,其中,运算电路603与第一存储器601、第二存储器602和控制器604可以通过总线进行数据通信。运算电路603用于提取第一存储器601和第二存储器602中的矩阵数据并进行向量的乘法和加法运算,控制器604用于依据预设的程序或者指令控制所述运算电路603完成向量的运算。其中,
第一存储器601用于存储第一矩阵。所述第一矩阵为M*K矩阵,若矩阵a为第一矩阵,则第一矩阵a中的第i行第j列的元素可以记为a ij,i=(1、2、3、……M),j=(1、2、3、……K);
本发明实施例中提到的第一存储器601,以及下文中提到的第二存储器602、第三存储器606和相关矩阵乘法器的内部存储器均可以为寄存器、随机存取存储器(random access memory,简称RAM)、静态随机存储器、闪存或其他可读可写的存储器。本申请中,的第一矩阵和第二矩阵以及运算结果的数据类型,都可以是int 8、fp16、或者fp32等类型。
第二存储器602用于存储第二矩阵,所述第二矩阵为K*N矩阵。若矩阵b为第二矩阵,则第二矩阵b中的第j行第g列的元素可以记为B jg,j=(1、2、3、……K),g=(1、2、3、……N);
其中,M、K和N,以及X和Y均为大于0的整数,M、N和K中的任意两个参数可以相等也可以不等,M、N和K也可以都相等或不等,X和Y可以相等可以不相等,本申请对此不作具体限定。
运算电路603,可以包括X行*Y列运算单元6030(可以称之为乘累加单元MAC),每个运算单元可以各自独立的进行向量乘法运算,图6中以运算电路603包括4*4个运算单元6031为例进行绘制,即X=4,Y=4。其中,运算单元6030设有两路输入,分别用于接收第一存储器601和第二存储器602发送的行向量和列向量,并对行向量和列向量进行向量乘法运算。具体地,一个运算单元6030包括向量乘法电路和加法电路,其中,矩阵乘法电路用于接收第一存储器601发送的行向量数据和第二存储器602发送的列向量数据,并将所述两路向量相乘;加法电路用于对所述两路向量相乘的结果相加,以及对属于同一个运算单元的计算结果进行累加得到每个运算单元的计算结果。
参见图7所示,为一种运算单元6030的结构图,在一种可能的实现方式中,向量乘法电路包括L(例如L=4)个乘法器;所述加法电路包括输入个数为(L+1)的加法树,即该加法树用于累加L个乘法结果,以及不同时钟周期在该运算单元的计算结果。可选地,矩阵乘法器60还包括第三存储器605,第三存储器605用于存储所述向量乘法电路和加法电路的运算结果,以及不同时钟周期的。可以理解的是,本申请中的第三存储器605中可以包括X*Y个存储单元,每个存储单元用于存储对应的运算单元的每次的运算结果。或者每个运算单元在第三存储器605中都对应指定的存储空间,用于存储每次的运算结果。
控制器604可以通过执行以下动作对第一矩阵和第二矩阵的乘积进行计算:
控制器604将所述第一矩阵以规模为X*L的子块为单位进行分块,得到S×R个相同大小的子块,其中,将所述S×R个子块的第s行第r列的子块记为A sr,s=(1、2、3、……S),r=(1、2、3、……R)。即对于本申请中的矩阵乘法器60来说,只要是生产或出厂之后,其所包括的X*Y列的矩阵数据就已经固定了,并且,对应的乘法电路中的乘法器个数L也是固定的,所以进行矩阵运算时,需要将第一矩阵和第二矩阵进行分形,即矩阵分块。划分的方式则是将第一矩阵以X*L的子块为单位进行分块。在本发明实施例中,分块的目的是为了把大矩阵拆分成很多个符合矩阵乘法器大小的小矩阵,再通过一定顺序计算小矩阵并把相关小矩阵的值进行累加,最终得出矩阵乘法结果。不仅可以灵活的计算,便于后续的复用和多级缓存,进一步的提升计算效率,降低数据搬运带宽和能耗。
需要说明的是,第一矩阵为M*K矩阵,可能会存在第一矩阵不能刚好被X*L的子块以整数个进行划分的情况。因此,当M/X或者K/L不为整数时,则可以以0元素进行填充补齐的方式来进行运算。或者,是在对应的位置不参与计算,将结果赋值为0。具体地,
Figure PCTCN2018111077-appb-000003
当M%X≠0,所述第一矩阵的第(M+1)到第(S*X-M)行均不计算并将结果赋值为0;当K%Y≠0,所述第一矩阵的第(K+1),到第R*Y-K行均不计算,并将结果赋值为0。也即是说在对应的行和列上,不通过运算单元对其进行实质性的乘法计算,而是当作已经运算过但结果为0来处理,如此一来可以节省对应运算单元的读取和运算功耗。
对应地,控制器604还控制将所述第二矩阵以规模为L*Y的子块为单位进行分块,得到R×T个相同大小的子块,其中,将所述R×T个子块中的第r行第t列记为B rt,r=(1、2、3、……R),t=(1、2、3、……T)。在控制器604控制将第一矩阵按照运算电路603的规格进行了分块之后,第二矩阵也必须对应的与第一矩阵进行匹配的划分,否则无法进行矩阵乘法计算。
需要说明的是,第二矩阵为K*N矩阵,可能会存在第二矩阵不能刚好被L*Y的子块以整数个进行划分的情况。因此,当K/L或者N/Y不为整数时,则可以以0元素进行填充补齐的方式来进行运算。或者,是在对应的位置不参与计算,将结果赋值为0。具体地,
Figure PCTCN2018111077-appb-000004
当K%Y≠0,所述第一矩阵的第(K+1)到第(R*Y-K)均不用计算并将结果赋值为0;当N%X≠0,所述第一矩阵的第(N+1)到第(T*X-N)行不计算,并将结果赋值为0。也即是说在对应的行和列上,不通过运算单元对其进行实质性的乘法计算,而是当作已经运算过但结果为0来处理,如此一来可以节省对应运算单元的读取和运算功耗。
在将第一矩阵和第二矩阵分别进行固定规格的分块之后,便可以输入到运算电路603中进行子块与子块之间的矩阵乘法运算。在具体计算过程中,控制器604可以控制将第一矩阵中的任意一个子块A sr的X个行向量中的第x行和对应的子块B rt的Y个列向量中的第y列,输入到所述X行*Y列运算单元中的第x行第y列的运算单元中进行运算,x=(1、2、3、……X),y=(1、2、3、……Y),其中,所述任意一个子块A sr中的r和所述对应的子 块B rt中的r的取值相等。由于在将子块A sr的行向量和B rt的列向量输入到运算单元之前,已经对第一矩阵和第二矩阵进行的矩阵分块,即分形。因此在具体按照何种顺序将子块A sr和对应的B rt输入到运算电路603中,可以有多种实施方式。
在一种可能的实现方式中,可以分别按照子块A sr和对应的B rt中的s的大小,或者t的大小顺序,先后进行运算。如图8所示,例如,第一矩阵为M*K矩阵,第二矩阵为K*N,假设M=12,K=6,N=12;X=4,Y=4,L=3将第一矩阵和第二矩阵进行分块之后得到,S=3,R=2,T=3。因此得到分块后的第一矩阵
Figure PCTCN2018111077-appb-000005
分块后的第二矩阵
Figure PCTCN2018111077-appb-000006
而A是X*L也即是4*3的矩阵,B中的每一个元素实际上都是一个L*Y也即是3*4的矩阵。
Figure PCTCN2018111077-appb-000007
在第一矩阵和第二矩阵的乘法运算中,需要对第一矩阵中的任意一个,即每一个子块A sr和第二矩阵中对应的子块B rt进行子块的矩阵乘法运算。在按照何种顺序决定首先进行哪些子块的矩阵乘法计算,可以包括多种实施方式,
方式一,按照矩阵乘法的顺序,例如,可以为子块A 11和子块B 11,在第一个子块乘法计算周期(可理解为第一轮)输入A 11的所有行向量和对应的B 11中的所有列向量进行运算。在第二个子块乘法计算周期(可理解为第二轮)再进行A 12的所有行向量和对应的B 21中的所有列向量进行运算,如此一来,经过运算单元的累加,可以得到结果矩阵C中的第一行第一列的结果点C 11的值。以此类推,便可以得到结果矩阵C上的所有位置的结果点。而实际上C 11=A 11B 11+A 12B 21,而其中,
Figure PCTCN2018111077-appb-000008
Figure PCTCN2018111077-appb-000009
也即是说,C 11实际上是4*4的矩阵,因此,根据矩阵的计算规则,最终得到的矩阵C是M*N的结果矩阵,也即是12*12的结果矩阵。
方式二,按照一定规则复用其中一个子块,本发明实施例提供一种子块复用的方式来调用第一矩阵中的一个子块A sr和第二矩阵中对应的子块B rt进行子块的矩阵乘法运算。具 体地,控制器604还用于控制:在至少两个连续的子块乘法计算周期内,所述s和r的取值不变,所述t的取值变化,以使得所述第一存储器在所述至少两个连续的子块乘法计算周期内复用同一个A sr,其中,所述子块乘法计算周期为所述X行*Y列运算单元计算完成一个子块A sr和对应的子块B rt的矩阵乘法运算所用的时间。
例如,在上述假设M=12,K=6,N=12;X=4,Y=4,L=3,的实施例中,在第一个子块乘法计算周期(可理解为第一轮)输入A 11的所有行向量和其中一个对应的子块B 11中的所有列向量进行运算。在第二个子块乘法计算周期(可理解为第二轮)保持s和r的取值不变,但是t需要改变,即再进行A 11的所有行向量和另一个对应的子块B 12中的所有列向量进行运算。可选地,在第三个子块乘法计算周期(可理解为第三轮)继续A 11的所有行向量和又一个对应的子块B 13中的所有列向量进行运算。如此一来,便可以在相邻的几个子块乘法计算周期内,可以重复利用第一存储中的A 11,节省读写开销,减少数据搬运带宽。
其中,在上述方式一和方式二中,针对第一矩阵中某一个子块A sr和第二矩阵中对应的子块B rt在一个子块乘法计算周期内的计算规则为,第一矩阵中的任意一个子块A sr的X个行向量中的第x行和对应的子块B rt的Y个列向量中的第y列,输入到所述X行*Y列运算单元中的第x行第y列的运算单元中进行运算,x=(1、2、3、……X),y=(1、2、3、……Y),其中,所述任意一个子块A sr中的r和所述对应的子块B rt中的r的取值相等。也即是,对于A sr和第二矩阵中对应的子块B rt,其任意一个行向量和任意一个列向量都是输入到X行*Y列运算单元中的指定运算单元进行计算的。例如,A 11中的第二个行向量[a 21 a 22 a 23],和第二矩阵中其中一个对应的子块B 11中的第三个列向量
Figure PCTCN2018111077-appb-000010
即是在X行*Y列运算单元中的第2行第3列对应的运算单元中进行运算,以此类推。
基于图6所示的运算电路603中的运算单元的排布方式,参见图9,图9为本发明实施例提供的一种具体的运算电路603中的布线的示意图。
其中BUFA为第一矩阵的第一存储器601,BUFB为第二矩阵的第二存储器602,BUFC为存储各个运算单元6030的计算结果的第三存储器605,运算电路603中包含X行*Y列(假设X=4,Y=4)个运算单元即为图中的MAC GRP R00C00到MAC GRP R03C03,且每一个运算单元MAC GRP可以进X*L矩阵的一个行向量和L*Y矩阵中的一列行向量的乘法运算。
运算电路603,在本发明实施例中可以称之为分形矩阵乘法单元,由3-D MAC阵列(MAC Cube)和累加器(Accumulator)组成,用于执行分形的矩阵乘法指令,如下:C=A*B,或者C=A*B+C,其中A/B/C为二维矩阵。其中A的尺寸为(M*base)x(K*base),B的尺寸为(K*base)x(N*base),C的尺寸为(M*base)x(N*base)。Base为此运算电路603的基础尺寸也即是X*Y,例如8*8、16*16、32*32等。上述C=A*B,或者C=A*B+C的计算操作称为MNK矩阵乘法(和累加)。在实际执行过程中,MNK的矩阵乘法会以分形方式,控制器会控制将大矩阵分解为base尺寸的基本矩阵乘法,并按照特定顺序组合(上述提到的方式一或方式二)完成。
分形矩阵乘法单元的具体架构如上述图7所示(假设Base=4),例如,图7中的中MAC Group为一个N*N(4*4)的乘累加器组,由N(4)个乘法单元,和一个N+1(5)输入累加树组成。在矩阵乘法层面,该乘累加器可以运算一行乘以一列并累加的运算(即结果矩阵中的一个元素)。图9中一共有4x4个乘累加器组,即可同时计算一个完整的4x4*4x4的矩阵乘法运算。
可以理解的是,图9中的布线的示意图,可以支持运算电路603在同一个时钟周期内,完成一个子块A sr和对应的子块B rt的矩阵乘法计算。因为,A sr的所有X个行向量和对应的子块B rt的所有Y个列向量均可以通过图9中的布线方式,从对应的BUFA和BUFB中同时到达对应的运算单元6030,因此,控制器604可控制运算电路603在一个时钟周期内完成一个子块A sr和对应的子块B rt的乘法计算,在下一个时钟周期内又完成另一个子块A sr和对应的子块B rt,,或者,相同子块A sr和对应的另一个子块B rt的矩阵乘法计算。
参见图10,图10为本发明实施例提供的另一种具体的运算电路603中的布线的示意图。在图10对应的运算电路603中,提供的是一种脉动阵列结构。具体地,控制器604用于控制任意一个A sr的行向量按照x行号从小到大的顺序依次进入到所述X行*Y列运算单元对应的第x行,并且相邻行向量在进入到同一列不同行的运算单元的时间差为1个时钟周期;所述控制器还用于同时控制所述对应的子块B rt的列向量按照y列号从小到大的顺序依次进入到所述X行*Y列运算单元对应的第y行,并且相邻列向量在进入到同一行不同列的运算单元的时间差为1个时钟周期。
即,为了充分利用各个运算单元6030(乘累加器),本发明实施例中的分形矩阵乘法单元可以为脉动阵列结构,不同于TPUv1的结构在于,每次脉动传输的数据量为L个(而TPUv1中为1个),因此,数据运算的并行度大于TPUv1中的脉动阵列。
基于脉动阵列架构,在上述图10分别对应的布线结构中,BUFA/B分别为用来缓存第一矩阵/第二矩阵的存储器,图10中第一矩阵缓存器件(BUFA)会将A矩阵中的单位矩阵分成X行并在每个时钟周期按顺序将同一行的L个元素送入脉动阵列中的一个运算单元。同样地,第二矩阵缓存器件(BUFB)会将第二矩阵中的单位矩阵分成Y列并在每个时钟周期按顺序将同一列的L个元素送入脉动阵列。具体时序如下:
BUFC为存放“A*B+C“计算中“C”(偏移量)矩阵的缓存器件(可以由L0缓存或缓存寄存器搭建),同时矩阵乘法中的中间值也可存放与BUFC。当乘累加器执行完乘法后,累加树会将乘完的L个中间值与BUFC中存放的1个偏移量或中间值进行累加。
以M=2N=2K=2(即8x8*8x8的矩阵乘法)为例,矩阵乘法器60中的控制器603会将矩阵相乘拆分为图11的格式,共8个4x4的单位矩阵运算。针对MNK的矩阵乘法运算,拆分的顺序有很多种可能,其规则可以按照上述方式一和方式二顺序进行运算,可以理解的是,利用方式二中的数据最大复用的策略可以减少数据读取的功耗。进行完MNK分形拆分后,控制器603的控制逻辑会将这8个分形分八个时钟周期传入脉动阵列,如图12到图15。其中,图12为T=0时刻的M=2N=2K=2的分形矩阵乘法器的流水执行,图13为T=1时刻的M=2N=2K=2矩阵乘法器的流水执行,图14为T=7时刻的M=2N=2K=2分形矩阵乘法器的流水执行,图15为T=11时刻的M=2N=2K=2分形矩阵乘法器的流水 执行。可以看出,在T=6即第7个时钟周期时,脉动阵列开始满负荷运行。最后6个时钟周期,单位矩阵分形传出脉动阵列,整个矩阵也完成了乘法运算。
可选的,参见图16,矩阵乘法器60还可以包括指令分发单元606、取指单元607、数据搬运单元608、矢量单元609、标量单元610、总线接口单元611。进一步地,本发明实施例提供的矩阵乘法器60可以作为协处理器挂载到中央处理器(Central Processing Unit,简称CPU)80上,由CPU为矩阵乘法器60分配计算任务,具体的,CPU80可以将第一矩阵和第二矩阵以及相关指令存储在外部存储器70中,矩阵乘法器60可以通过读取外部存储器70中的第一矩阵和第二矩阵以及相关指令完成矩阵乘法运算。外部存储器70具体可以为双倍数据率同步动态随机存储器(Double Data Rate Synchronous Dynamic Random Access Memory,简称DDR)或其他可读可写的存储器,外部存储器可以为私有于矩阵乘法器60的存储器。具体的,第一存储器601、第二存储器602、第三存储器605以及外部存储器70一般都为片上存储器(On-Chip Buffer),其中
1、矢量单元609(Vector Unit):含有各类型多并行度计算器件(如浮点乘法,浮点加法,浮点大小比较等……),用于执行SIMD(Single Instruction multiple data)指令。并且负责统一缓存(Unified Buffer)与L0C缓存(即第一存储器、第二存储器和第三存储器)直接的数据搬运。
2、标量单元610(Scalar Unit):含有各类型整形基本运算器件(如加法,乘法,比较,移位等……)。
3、数据搬运单元(Direct Memory Access Unit,DMA Unit),用于将搬运各个存储单元内的数据,如将数据从L1RAM搬运到L0RAM,本发明实施例中的数据搬运单元,在从矩阵乘法器的外部存储器或者内部存储器搬运参与乘法运算的矩阵数据的时候,需要将矩阵按照分块后的结果,进行存储,例如,对于一个2*2的矩阵来说,第一矩阵中的第一行的第一列的子块
Figure PCTCN2018111077-appb-000011
以分块为单位来存储,则是将A0、A1、A2、A3为一行进行存储,以此类推。如此一来可以在将第一矩阵或第二矩阵搬运到对应的第一存储器,或者将第二矩阵搬运到对应的第二存储器中时,可以按照上述方式进行存储,当运算单元需要读取时,也可以按照上述存储顺序来读取,便于在计算时,当行向量需要转置为列向量时,可以灵活快速的进行转置。
4、取指单元607(Instruction Fetch Unit,IFU),即取指令单元,内部集成了PC(程序计数器)和IM(指令存储器),通过总线接口单元(BIU)611从主存获取指令,并解码和控制执行流程。
5、指令分发单元606(Dispatch Unit),通过对即取指单元传输过来的指令进行解析,并提交指令对应的类型指令到4个流水线单元,其中流水线单元即为图16中的标量单元(Scalar Unit),数据搬运(Direct Memory Access Unit,DMAUnit)单元,矢量单元(Vector Unit),分形矩阵乘法单元。指令分发单元有机制控制4个流水线之间的执行保序。
需要说明的是,上述流水线单元有异步执行(Posted Execution)和同步执行两种类型。所有类型指令都为保序发射,区别是异步执行单元执行指令异步结束,同步执行单元为同步结束;其中标量单元(Scalar Unit)为同步执行;分形矩阵乘法单元(Fractal Mat Mult Unit)、 DMA单元和矢量单元(Vector Unit)为异步执行。
在一种可能的实现方式中,针对上述的数据搬运单元,本发明实施例中提供了一种可配置的随路矩阵转置功能。例如,当第一矩阵的其中一个分块矩阵从某个存储器(例如矩阵乘法器的外部存储器)搬运到另一个存储器(矩阵乘法器的内部存储器第一存储器)时,数据搬运单元会在搬运途中做出转置矩阵的操作,并按照转置后的矩阵顺序进行存放。由于矩阵转置是神经网络训练过程的必要操作环节。较之于先搬运后做转置的普通指令,本发明实施例中可配置随路矩阵转置的搬运指令会更加灵活,也让软件更加容易,简洁。如下表所示,
普通指令:可配置的随路矩阵转置功能的指令:
Figure PCTCN2018111077-appb-000012
普通搬运指令与可配置随路矩阵转置功能的搬运指令对比,通过支持可配置的随路矩阵转置功能可以使同一个指令在带上不同参数的情况下支持更多应用场景。设计了适用于分形矩阵乘法处理器架构的可配置随路矩阵转置方法。
参见图17,为了方便数据的重用,降低功耗,以及降低对紧耦合片上存储器的依赖,本发明实施例还提供一种采取了多级缓存的存储结构。所有运算单元可以通过统一缓存(Unified Buffer)读写交互数据。矩阵乘法器的内部有L1和L0两级专用缓存。L1缓存和统一缓存通过数据搬运DMA单元和外部存储空间交换数据;外部储存空间由多级储存单元组成。例如,矩阵乘法器中含有多级缓存,从L0,到L1,再到L2缓存,容量递增,访问带宽递减,延迟递增,功耗开销递增。L0为最内层的缓存,可以用于缓存MNK乘法指令里“第一矩阵”,“第二矩阵”,“结果矩阵”的三个矩阵。由于贴近计算,故对带宽、延迟要求最高,数据重用的机会最大。可以用D触发器(DFF)搭建,达到提高性能,降低功耗的目的。分形指令的源、目的操作数来自L1(图17中的第五存储器612和第四存储器613)。在执行过程中借助L0(即图17中的第一存储器601和第二存储器602)复用数据。分形指令以上的软件可以借助L1重用数据。利用分形指令的执行顺序,以及分形指令以上的软件控制顺序,可以实现多级缓存数据重用。同时,利用多级缓存数据的重用,数据在各个缓存中的数据搬运时间也可被掩盖。以下图表的例子可以说明多级缓存之间的数据复用与搬运:
假设有如下两个矩阵:
Figure PCTCN2018111077-appb-000013
其数据搬运步骤如下表所示:
时间 读取L1 存入L0 计算
1 A0,B0    
2 B1 A0,B0 A0*B0
3 A2 A0,B0,B1 A0*B1
4 A1 A0,A2,B0,B1 A2*B0
5 B2 A1,A2,B0,B1 A2*B1
6 B3 A1,A2,B1,B2 A1*B2
7 A3 A1,A2,B2,B3 A1*B3
8   A2,A3,B2,B3 A3*B2
9   A2,A3,B2,B3 A3*B3
在时间1时,控制器60从L1缓存读取矩阵的A0,B0部分并存入L0。
在时间2,A0和B0分形矩阵已经可以从L0读出并参与运算,这时硬件会同时从L1读出B1分形并存入L0,为下一次的运算做准备,同时数据读取的时间就也会被计算掩盖。此时硬件并不用同时读取两个分形矩阵,而是只读B1矩阵。因为时间3的矩阵计算“A0*B1”复用了在时间1存入的数据A0。参照上述列表,可知在之后的计算中,每个时间单位都会有数据复用。
需要说明的是,本发明实施例不局限于L1到L0之间的数据搬运,在L2(例如外部存储器701和外部存储器702)到L1缓存的数据搬运也可以利用数据复用性来达到减少带宽,优化能耗的目的。本发明实施例不限定矩阵数据拆分的方式和搬运的顺序,数据搬运应当将数据复用最大化,以达到在每个时间单位,分形矩阵计算都在满负荷运行。
本发明实施例中通过多级缓存结构,利用矩阵分形的数据重用,分形指令的执行顺序,以及分形指令以上的软件控制顺序,可以实现多级缓存数据重用降低了对紧耦合片上存储器的依赖,优化了能效,降低了软件编程复杂度。
本发明实施例在矩阵进行乘法运算的指令的的执行顺序中,包括指令同步执行和异步执行两种方式:
由于本发明实施例中,分形矩阵乘法指令执行前需要一系列的控制准备与数据准备,如:矩阵尺寸的计算,矩阵数据的读取,目标地址计算等。如果处理器的指令执行策略为同步执行,即所有指令需要按顺序返回(commit),指令很大可能会等待不相关联的指令结束才开始执行。如此会造成较大并且不必要的性能损失,如下流程:为指令同步执行顺序。
地址计算→控制准备→矩阵0读取→矩阵0乘法→地址计算→控制准备→矩阵1读取→矩阵1乘法。
上述执行顺序中,第二次矩阵1的控制准备,地址计算,数据读取并不依赖矩阵0乘法的结束,这部分多出的时间会造成不必要的等待时间。为了解决这个问题,在本发明实施例中硬件指令分发单元606采取了多通道保序发射,以此保证不同类型的指令可以按顺序地同时执行。如上例中控制准备、地址计算在标量通道内保序执行,矩阵读取和存储在数据搬运通道内保序执行,矩阵乘法计算也会在矩阵运算通道内保序执行。各个通道可以互相重叠不保序,相互依赖的指令可以通过设置等待标志(Wait Flag)同步。通过指令异步执行的策略,指令之间可以达到并行,提高了运行效率。若上述同步执行顺序的例子采用异步执行策略其效果如图18所示,在指令异步执行顺序中,各指令并不保序,相关存在依赖关系的指令可以通过软件添加的等待指令来同步。采用这种异步执行的方式可以掩盖 分形矩阵乘法的控制准备开销。设计了适用于分形矩阵乘法编程方式的异步执行方式。
提供了一种矩阵乘法器,该矩阵乘法器利用控制器完成一种矩阵相乘的分块方法,即MNK分形,通过矩阵乘法器60中的内部的控制器604的控制逻辑将大矩阵拆分为单位矩阵相乘(即X*LxL*Y的矩阵)。控制器604的控制逻辑会在每个时钟周期向运算电路603发送单位矩阵乘法任务,使得数据流水执行,使得X行*Y列运算单元满负荷运作。提高矩阵乘法的效率,达到显著的提高神经网络算法的应用效果。本发明实施例提供的矩阵乘法器可以进行卷积神经网络中的卷积运算和FC运算。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件程序实现时,可以全部或部分地以计算机程序产品的形式来实现。该计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或者数据中心通过有线(例如同轴电缆、光纤、数字用户线(Digital Subscriber Line,简称DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可以用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带),光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,简称SSD))等。
尽管在此结合各实施例对本申请进行了描述,然而,在实施所要求保护的本申请过程中,本领域技术人员通过查看所述附图、公开内容、以及所附权利要求书,可理解并实现所述公开实施例的其他变化。在权利要求中,“包括”(comprising)一词不排除其他组成部分或步骤,“一”或“一个”不排除多个的情况。单个处理器或其他单元可以实现权利要求中列举的若干项功能。相互不同的从属权利要求中记载了某些措施,但这并不表示这些措施不能组合起来产生良好的效果。
尽管结合具体特征及其实施例对本申请进行了描述,显而易见的,在不脱离本申请的精神和范围的情况下,可对其进行各种修改和组合。相应地,本说明书和附图仅仅是所附权利要求所界定的本申请的示例性说明,且视为已覆盖本申请范围内的任意和所有修改、变化、组合或等同物。显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。

Claims (10)

  1. 一种矩阵乘法器,其特征在于,包括:
    第一存储器,用于存储第一矩阵,所述第一矩阵为M*K矩阵;
    第二存储器,用于存储第二矩阵,所述第二矩阵为K*N矩阵;
    与所述第一存储器和所述第二存储器连接的运算电路,所述运算电路包括X行*Y列运算单元,每个运算单元包括向量乘法电路和加法电路,所述矩阵乘法电路用于接收所述第一存储器发送的行向量数据和第二存储器发送的列向量数据,并将所述两路向量相乘;所述加法电路用于对所述两路向量相乘的结果相加,以及对属于同一个运算单元的计算结果进行累加得到每个运算单元的运算结果;
    与所述运算电路连接的控制器,所述控制器用于执行以下动作:
    将所述第一矩阵以规模为X*L的子块为单位进行分块,得到S×R个相同大小的子块,其中,将所述S×R个子块的第s行第r列的子块记为A sr,s=(1、2、3、……S),r=(1、2、3、……R);
    将所述第二矩阵以规模为L*Y的子块为单位进行分块,得到R×T个相同大小的子块,其中,将所述R×T个子块中的第r行第t列记为B rt,r=(1、2、3、……R),t=(1、2、3、……T);
    所述控制器还用于执行以下动作:
    将任意一个子块A sr的X个行向量中的第x行和对应的子块B rt的Y个列向量中的第y列,输入到所述X行*Y列运算单元中的第x行第y列的运算单元中进行运算,x=(1、2、3、……X),y=(1、2、3、……Y),其中,所述任意一个子块A sr中的r和所述对应的子块B rt中的r的取值相等。
  2. 如权利要求1所述的矩阵乘法器,其特征在于,所述控制器具体用于执行以下动作:
    将任意一个子块A sr的X个行向量中的第x行和对应的子块B rt的Y个列向量中的第y列,在同一个时钟周期并行输入到所述X行*Y列运算单元中的第x行第y列的运算单元中进行运算。
  3. 如权利要求1或2所述的矩阵乘法器,其特征在于,
    所述控制器还用于控制所述任意一个A sr的行向量按照x行号从小到大的顺序依次进入到所述X行*Y列运算单元对应的第x行,并且相邻行向量在进入到同一列不同行的运算单元的时间差为1个时钟周期;所述控制器还用于同时控制所述对应的子块B rt的列向量按照y列号从小到大的顺序依次进入到所述X行*Y列运算单元对应的第y行,并且相邻列向量在进入到同一行不同列的运算单元的时间差为1个时钟周期。
  4. 如权利要求1-3任意一项所述的矩阵乘法器,其特征在于,所述控制器还用于控制:
    在至少两个连续的子块乘法计算周期内,所述s和r的取值不变,所述t的取值变化,以使得所述第一存储器在所述至少两个连续的子块乘法计算周期内复用同一个A sr,其中,所述子块乘法计算周期为所述X行*Y列运算单元计算完成一个子块A sr和对应的子块B rt 的矩阵乘法运算所用的时间。
  5. 如权利要求1-4任意一项所述的矩阵乘法器,其特征在于,所述矩阵乘法器还包括与所述运算电路连接的第三存储器;
    所述控制器用于控制,所述X行*Y列运算单元将向量乘法电路和加法电路的所述运算结果存储至所述第三存储器。
  6. 如权利要求5所述的矩阵乘法器,其特征在于,所述矩阵乘法器还包括,与所述第一存储器和所述第二存储器相连的第四存储器,以及,与所述第三存储器相连的第五存储器;
    所述控制器还用于控制在计算第一矩阵和第二矩阵的乘法运算之前:
    将所述第一矩阵和第二矩阵的数据源从所述第四存储器分别搬运至所述第一存储器和所述第二存储器;并将所述计算结果从第三存储器搬运至所述第五存储。
  7. 如权利要求1-6任意一项所述的矩阵乘法器,其特征在于,所述向量乘法电路包括L个乘法器;所述加法电路包括输入个数为(L+1)的加法树。
  8. 如权利要求1-7任一项所述的矩阵乘法器,其特征在于,
    与所述第一存储器、第二存储器、所述运算电路以及所述控制器通过总线接口单元相连接。
  9. 如权利要求1-8任意一项所述的矩阵乘法器,其特征在于,
    Figure PCTCN2018111077-appb-100001
    当M%X≠0,所述第一矩阵的第(M+1)到第(S*X-M)行均不计算,并将结果赋值为0,当K%Y≠0,所述第一矩阵的第(K+1),到第R*Y-K行均不计算,并将结果赋值为0。
  10. 如权利要求1-8任意一项所述的矩阵乘法器,其特征在于,
    Figure PCTCN2018111077-appb-100002
    当K%Y≠0,所述第一矩阵的第(K+1)到第(R*Y-K)均用不计算,并将结果赋值为0,当N%X≠0,所述第一矩阵的第(N+1)到第(T*X-N)行不计算,并将结果赋值为0。
PCT/CN2018/111077 2017-12-29 2018-10-19 矩阵乘法器 WO2019128404A1 (zh)

Priority Applications (6)

Application Number Priority Date Filing Date Title
KR1020207021471A KR102443546B1 (ko) 2017-12-29 2018-10-19 행렬 곱셈기
KR1020227031367A KR102492477B1 (ko) 2017-12-29 2018-10-19 행렬 곱셈기
EP18895760.9A EP3726399A4 (en) 2017-12-29 2018-10-19 MATRICAL MULTIPLIER
JP2020536531A JP6977239B2 (ja) 2017-12-29 2018-10-19 行列乗算器
US16/915,915 US11334648B2 (en) 2017-12-29 2020-06-29 Matrix multiplier
US17/725,492 US11934481B2 (en) 2017-12-29 2022-04-20 Matrix multiplier

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711499179.X 2017-12-29
CN201711499179.XA CN109992743B (zh) 2017-12-29 2017-12-29 矩阵乘法器

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/915,915 Continuation US11334648B2 (en) 2017-12-29 2020-06-29 Matrix multiplier

Publications (1)

Publication Number Publication Date
WO2019128404A1 true WO2019128404A1 (zh) 2019-07-04

Family

ID=67065034

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/111077 WO2019128404A1 (zh) 2017-12-29 2018-10-19 矩阵乘法器

Country Status (6)

Country Link
US (2) US11334648B2 (zh)
EP (1) EP3726399A4 (zh)
JP (1) JP6977239B2 (zh)
KR (2) KR102443546B1 (zh)
CN (2) CN111859273A (zh)
WO (1) WO2019128404A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3846036A1 (en) * 2019-12-31 2021-07-07 Beijing Baidu Netcom Science And Technology Co. Ltd. Matrix storage method, matrix access method, apparatus and electronic device
CN113704689A (zh) * 2021-08-25 2021-11-26 北京大学 一种基于昇腾ai处理器的矩阵乘算子的处理方法及装置

Families Citing this family (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110060195B (zh) 2018-01-19 2021-05-04 华为技术有限公司 一种数据处理的方法及装置
CN108983236B (zh) * 2018-07-27 2022-05-20 南京航空航天大学 一种sar回波数据预滤波技术的fpga实现方法
CN110543934B (zh) * 2019-08-14 2022-02-01 北京航空航天大学 一种用于卷积神经网络的脉动阵列计算结构及方法
CN112446007A (zh) * 2019-08-29 2021-03-05 上海华为技术有限公司 一种矩阵运算方法、运算装置以及处理器
CN112579042B (zh) * 2019-09-29 2024-04-19 广州希姆半导体科技有限公司 计算装置及方法、芯片、电子设备及计算机可读存储介质
US11194549B2 (en) * 2019-10-25 2021-12-07 Arm Limited Matrix multiplication system, apparatus and method
CN114600126A (zh) * 2019-10-30 2022-06-07 华为技术有限公司 一种卷积运算电路和卷积运算方法
CN111079081B (zh) * 2019-12-16 2021-02-12 海光信息技术股份有限公司 一种矩阵乘法器、数据处理方法、集成电路器件及处理器
CN111291323B (zh) * 2020-02-17 2023-12-12 南京大学 一种基于脉动阵列的矩阵乘法处理器及其数据处理方法
CN113536221B (zh) * 2020-04-21 2023-12-15 中科寒武纪科技股份有限公司 运算方法、处理器以及相关产品
CN113536219B (zh) * 2020-04-21 2024-01-26 中科寒武纪科技股份有限公司 运算方法、处理器以及相关产品
CN111581595B (zh) * 2020-04-24 2024-02-13 科大讯飞股份有限公司 一种矩阵乘法计算方法及计算电路
KR102393629B1 (ko) * 2020-04-29 2022-05-02 연세대학교 산학협력단 다중 디콘볼루션 레이어를 갖는 인공 신경망을 이용하는 이미지 업스케일링 장치 및 이의 디콘볼루션 레이어 다중화 방법
CN113918879A (zh) * 2020-07-08 2022-01-11 华为技术有限公司 矩阵运算的方法和加速器
US20220051086A1 (en) * 2020-08-17 2022-02-17 Alibaba Group Holding Limited Vector accelerator for artificial intelligence and machine learning
US20220164663A1 (en) * 2020-11-24 2022-05-26 Arm Limited Activation Compression Method for Deep Learning Acceleration
CN112632464B (zh) * 2020-12-28 2022-11-29 上海壁仞智能科技有限公司 用于处理数据的处理装置
CN112612447B (zh) * 2020-12-31 2023-12-08 安徽芯纪元科技有限公司 一种矩阵计算器及基于该矩阵计算器的全连接层计算方法
KR20220101518A (ko) * 2021-01-11 2022-07-19 에스케이하이닉스 주식회사 곱셈-누산 회로 및 이를 포함하는 프로세싱-인-메모리 장치
CN112991142B (zh) * 2021-03-31 2023-06-16 腾讯科技(深圳)有限公司 图像数据的矩阵运算方法、装置、设备及存储介质
CN113296733B (zh) * 2021-04-25 2024-09-03 阿里巴巴创新公司 数据处理方法以及装置
CN116710912A (zh) * 2021-04-26 2023-09-05 华为技术有限公司 一种矩阵乘法器及矩阵乘法器的控制方法
CN113032723B (zh) * 2021-05-25 2021-08-10 广东省新一代通信与网络创新研究院 一种矩阵乘法器的实现方法及矩阵乘法器装置
US20220414053A1 (en) * 2021-06-24 2022-12-29 Intel Corporation Systolic array of arbitrary physical and logical depth
CN113268708B (zh) * 2021-07-16 2021-10-15 北京壁仞科技开发有限公司 用于矩阵计算的方法及装置
CN113918120A (zh) * 2021-10-19 2022-01-11 Oppo广东移动通信有限公司 计算装置、神经网络处理设备、芯片及处理数据的方法
CN114630108A (zh) * 2022-03-01 2022-06-14 维沃移动通信有限公司 感光数据校正电路、方法、装置、电子设备及介质
TWI814618B (zh) * 2022-10-20 2023-09-01 創鑫智慧股份有限公司 矩陣運算裝置及其操作方法
US20240168762A1 (en) * 2022-11-21 2024-05-23 Nvidia Corporation Application programming interface to wait on matrix multiply-accumulate
CN115756384B (zh) * 2022-11-22 2024-05-17 海光信息技术股份有限公司 张量计算单元及使用方法、数据处理装置及操作方法
CN116192359B (zh) * 2022-12-27 2024-01-05 北京瑞莱智慧科技有限公司 一种同态乘法阵列电路和数据处理方法
CN115827261B (zh) * 2023-01-10 2023-05-19 北京燧原智能科技有限公司 基于分布式网络的数据同步方法、装置、服务器及介质
CN116795432B (zh) * 2023-08-18 2023-12-05 腾讯科技(深圳)有限公司 运算指令的执行方法、装置、电路、处理器及设备

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101086699A (zh) * 2007-07-12 2007-12-12 浙江大学 基于单fpga的矩阵乘法器装置
CN103902509A (zh) * 2014-04-23 2014-07-02 荣成市鼎通电子信息科技有限公司 Wpan中全并行输入的循环左移准循环矩阵乘法器
CN107315574A (zh) * 2016-04-26 2017-11-03 北京中科寒武纪科技有限公司 一种用于执行矩阵乘运算的装置和方法

Family Cites Families (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH02266458A (ja) * 1989-04-06 1990-10-31 Nec Corp ニューラルネットワークシミュレーション装置
DE4036455C1 (zh) * 1990-11-15 1992-04-02 Siemens Ag, 8000 Muenchen, De
JPH0644196A (ja) * 1992-07-24 1994-02-18 Toshiba Corp 並列計算機用マイクロプロセッサ
JPH06175986A (ja) * 1992-12-10 1994-06-24 Nippon Telegr & Teleph Corp <Ntt> 行列演算の並列処理方法
CN100449522C (zh) * 2007-07-12 2009-01-07 浙江大学 基于多fpga的矩阵乘法并行计算系统
US8250130B2 (en) * 2008-05-30 2012-08-21 International Business Machines Corporation Reducing bandwidth requirements for matrix multiplication
CN104346318B (zh) * 2014-10-15 2017-03-15 中国人民解放军国防科学技术大学 面向通用多核dsp的矩阵乘加速方法
CN105589677A (zh) * 2014-11-17 2016-05-18 沈阳高精数控智能技术股份有限公司 一种基于fpga的脉动结构矩阵乘法器及其实现方法
CN104636316B (zh) * 2015-02-06 2018-01-12 中国人民解放军国防科学技术大学 面向gpdsp的大规模矩阵乘法计算的方法
CN104915322B (zh) * 2015-06-09 2018-05-01 中国人民解放军国防科学技术大学 一种卷积神经网络硬件加速方法
CN104899182B (zh) * 2015-06-09 2017-10-31 中国人民解放军国防科学技术大学 一种支持可变分块的矩阵乘加速方法
US10853448B1 (en) * 2016-09-12 2020-12-01 Habana Labs Ltd. Hiding latency of multiplier-accumulator using partial results
CN106445471B (zh) * 2016-10-13 2018-06-01 北京百度网讯科技有限公司 处理器和用于在处理器上执行矩阵乘运算的方法
JP6912703B2 (ja) * 2017-02-24 2021-08-04 富士通株式会社 演算方法、演算装置、演算プログラム及び演算システム
JP6907700B2 (ja) * 2017-05-23 2021-07-21 富士通株式会社 情報処理装置、マルチスレッド行列演算方法、およびマルチスレッド行列演算プログラム
CN109213962B (zh) 2017-07-07 2020-10-09 华为技术有限公司 运算加速器
CN111860815A (zh) * 2017-08-31 2020-10-30 中科寒武纪科技股份有限公司 一种卷积运算方法及装置
US10713214B1 (en) * 2017-09-27 2020-07-14 Habana Labs Ltd. Hardware accelerator for outer-product matrix multiplication
US12061990B2 (en) * 2017-10-17 2024-08-13 Xilinx, Inc. Static block scheduling in massively parallel software defined hardware systems
US10346163B2 (en) * 2017-11-01 2019-07-09 Apple Inc. Matrix computation engine
CN112840356B (zh) * 2018-10-09 2023-04-11 华为技术有限公司 运算加速器、处理方法及相关设备
US11989257B2 (en) * 2020-10-29 2024-05-21 Hewlett Packard Enterprise Development Lp Assigning processing threads for matrix-matrix multiplication

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101086699A (zh) * 2007-07-12 2007-12-12 浙江大学 基于单fpga的矩阵乘法器装置
CN103902509A (zh) * 2014-04-23 2014-07-02 荣成市鼎通电子信息科技有限公司 Wpan中全并行输入的循环左移准循环矩阵乘法器
CN107315574A (zh) * 2016-04-26 2017-11-03 北京中科寒武纪科技有限公司 一种用于执行矩阵乘运算的装置和方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3726399A4 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3846036A1 (en) * 2019-12-31 2021-07-07 Beijing Baidu Netcom Science And Technology Co. Ltd. Matrix storage method, matrix access method, apparatus and electronic device
US11635904B2 (en) 2019-12-31 2023-04-25 Kunlunxin Technology (Beijing) Company Limited Matrix storage method, matrix access method, apparatus and electronic device
CN113704689A (zh) * 2021-08-25 2021-11-26 北京大学 一种基于昇腾ai处理器的矩阵乘算子的处理方法及装置
CN113704689B (zh) * 2021-08-25 2022-11-11 北京大学 一种基于昇腾ai处理器的矩阵乘算子的处理方法及装置

Also Published As

Publication number Publication date
EP3726399A1 (en) 2020-10-21
KR20200098684A (ko) 2020-08-20
CN109992743B (zh) 2020-06-16
US11934481B2 (en) 2024-03-19
KR20220129107A (ko) 2022-09-22
CN111859273A (zh) 2020-10-30
US20220245218A1 (en) 2022-08-04
EP3726399A4 (en) 2021-02-17
JP2021508125A (ja) 2021-02-25
KR102443546B1 (ko) 2022-09-15
CN109992743A (zh) 2019-07-09
KR102492477B1 (ko) 2023-01-30
JP6977239B2 (ja) 2021-12-08
US11334648B2 (en) 2022-05-17
US20200334322A1 (en) 2020-10-22

Similar Documents

Publication Publication Date Title
WO2019128404A1 (zh) 矩阵乘法器
CN111291859B (zh) 通用矩阵-矩阵乘法数据流加速器半导体电路
CN108805266B (zh) 一种可重构cnn高并发卷积加速器
US20200257754A1 (en) Performing matrix multiplication in hardware
WO2019007095A1 (zh) 运算加速器
US20180341495A1 (en) Hardware Accelerator for Convolutional Neural Networks and Method of Operation Thereof
WO2017185389A1 (zh) 一种用于执行矩阵乘运算的装置和方法
WO2019041251A1 (zh) 芯片装置及相关产品
WO2019205617A1 (zh) 一种矩阵乘法的计算方法及装置
TW202414199A (zh) 用於訓練神經網路之方法、系統及非暫時性電腦可讀儲存媒體
US20240119114A1 (en) Matrix Multiplier and Matrix Multiplier Control Method
CN110059809B (zh) 一种计算装置及相关产品
WO2022205197A1 (zh) 一种矩阵乘法器、矩阵计算方法及相关设备
CN111079908B (zh) 片上网络数据处理方法、存储介质、计算机设备和装置
CN111178505B (zh) 卷积神经网络的加速方法和计算机可读存储介质
CN111291884B (zh) 神经网络剪枝方法、装置、电子设备及计算机可读介质
JP7136343B2 (ja) データ処理システム、方法、およびプログラム
CN115081600A (zh) 执行Winograd卷积的变换单元、集成电路装置及板卡
CN111222632B (zh) 计算装置、计算方法及相关产品
CN116795542A (zh) 一种基于GPU的后量子密码Kyber并行加速方法
Kadota et al. A new reconfigurable architecture with smart data-transfer subsystems for the intelligent image processing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18895760

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020536531

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2018895760

Country of ref document: EP

Effective date: 20200714

ENP Entry into the national phase

Ref document number: 20207021471

Country of ref document: KR

Kind code of ref document: A