WO2022007597A1 - 矩阵运算的方法和加速器 - Google Patents

矩阵运算的方法和加速器 Download PDF

Info

Publication number
WO2022007597A1
WO2022007597A1 PCT/CN2021/099891 CN2021099891W WO2022007597A1 WO 2022007597 A1 WO2022007597 A1 WO 2022007597A1 CN 2021099891 W CN2021099891 W CN 2021099891W WO 2022007597 A1 WO2022007597 A1 WO 2022007597A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
subset
storage space
memory
accelerator
Prior art date
Application number
PCT/CN2021/099891
Other languages
English (en)
French (fr)
Inventor
李涛
卢廷玉
崔宝龙
俞立呈
刘昊程
林伟彬
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP21837272.0A priority Critical patent/EP4180996A4/en
Publication of WO2022007597A1 publication Critical patent/WO2022007597A1/zh
Priority to US18/093,929 priority patent/US20230161835A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • G06F16/278Data partitioning, e.g. horizontal or vertical partitioning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2213/00Indexing scheme relating to interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F2213/28DMA

Definitions

  • the present application relates to the field of computers, and in particular, to a method and accelerator for matrix operation.
  • the process of matrix operation is usually as follows: first, the processor loads the data to be performed matrix operation into the register from the main memory (English: main memory, hereinafter referred to as main memory); then, the processor loads the data in the register into the register. After the data is subjected to matrix operation, the result of the matrix operation is obtained. It can be seen that the matrix operation depends on the computing power of the processor and the resources of the registers in the processor. With the explosive growth of information, the scale of matrices involved in matrix operations continues to increase. Due to the limited computing power of the processor and the resources of registers in the processor, efficient matrix operations cannot be performed on large-scale matrices. Therefore, how to provide an efficient matrix operation method has become an urgent technical problem to be solved.
  • the present application provides a matrix operation method and accelerator, so that the matrix operation is not limited by the computing capability of the processor and the resources of the registers in the processor, and the matrix operation can be efficiently completed.
  • the present application provides an accelerator for matrix operations.
  • the accelerator includes at least a control (control, CTRL) unit, a memory, and a processing unit (process element, PE).
  • CTRL unit is used to receive matrix operation instructions;
  • the memory is used to divide the storage area into multiple storage spaces, for example, into a first storage space, a second storage block and a third storage space, then, the memory is used to The storage space stores a subset of the first matrix, stores a subset of the second matrix in the second storage space, and stores a third matrix in the third storage space.
  • the first matrix and the second matrix are participating matrices indicated by the matrix operation instruction.
  • the third matrix is a matrix composed of a subset obtained by multiplying a subset of the first matrix and a subset of the second matrix; the PE is responsible for the first storage space based on the matrix operation instruction.
  • the subset of a matrix and the subset of the second matrix in the second storage space perform a matrix operation to obtain a result of the matrix operation.
  • the matrix operation accelerator includes at least one PE.
  • the multiple PEs may be used to perform operations on the subset of the first matrix in the first storage space and the second matrix in the second storage space based on the matrix operation instruction, respectively. Subsets perform parallel matrix operations to obtain the results of the matrix operations. In this way, multiple PEs perform matrix operations in parallel, so that the speed of matrix operations no longer depends on the calculation speed of a PE. Even for large-scale matrices, the accelerator for matrix operations can quickly complete operations, greatly improving the efficiency of matrix operations.
  • the PE in the matrix operation accelerator may also update a subset of the third matrix in the third storage space based on the result of the matrix operation, where the subset of the third matrix is the first matrix obtained by performing matrix operations on a subset of , and a subset of the second matrix. For example: Assume that the subset of the third matrix in the current third storage space is the subset C0 obtained by multiplying the subset A0 of the first matrix A and the subset B0 of the second matrix B; After the subset A0 is multiplied by the subset B1 of the second matrix B, the result C1 of the matrix operation is obtained. Then, the PE updates the subset of the third matrix in the third storage space based on the result C1 of the matrix operation.
  • C1 can be accumulated.
  • the subset of the third matrix in the third storage space after the update is C0+C1.
  • Updating the subset of the third matrix in the third storage space may specifically be: replacing the subset C0 of the third matrix in the current third storage space with C2, and the subset of the third matrix in the third storage space after the update is C2 .
  • each PE can determine the subset of matrix operations that it is responsible for based on the indication of the CTRL unit, and determine the result of the matrix operation it obtains (which can be an intermediate structure or the result of the final composition of the third matrix. ) is stored in the third storage space.
  • the CTRL unit in the matrix operation accelerator may also perform the matrix operation instruction on the first matrix and the second matrix based on the matrix operation instruction before performing the matrix operation. Block to obtain multiple subsets of the first matrix and multiple subsets of the second matrix.
  • the subset may be composed of several elements in at least one consecutive row or consecutive column in the matrix. Each subset obtained by dividing a matrix must include consecutive elements in the matrix, any element in the matrix can only be included in a subset, and all elements in the matrix need to be included in a subset.
  • the CTRL unit divides the matrix, and the scales of the obtained subsets can be the same or different, but it is necessary to ensure that the subset of the first matrix and the subset of the second matrix after the block are multiplicative.
  • a subset of a matrix contains the same number of columns as a subset of a second matrix contains the same number of rows.
  • the matrix may be divided into square matrices of preset size in a left-to-right and top-to-bottom manner, that is, the obtained subsets of the matrix are all square matrices with the same number of rows and columns.
  • the matrix to be operated on is divided into blocks by the CTRL unit, so that the matrix operation accelerator can perform block operation on the subset of the divided matrix, and for multiple PE matrix operation accelerators, in order to realize multiple PE's parallel matrix operations provide the data base that makes fast and efficient matrix operations possible.
  • the matrix operation accelerator may further include a direct memory access (direct memory access, DMA) unit, and the DMA unit is used to realize data access when the matrix operation accelerator performs matrix operation operate.
  • DMA unit may obtain N first subsets of the first matrix and N second subsets of the second matrix from the shared storage space according to the result of the CTRL unit division, and store the N first subsets of the second matrix.
  • the subset and the N second subsets are respectively stored in the first storage space and the second storage space of the memory, where N is greater than or equal to the number of PEs included in the matrix operation accelerator, and N is a positive integer.
  • the shared storage space is the storage space shared by the processor and the accelerator for the matrix operation, and the shared storage space may be, for example, the main memory.
  • the value of N is usually related to the size of the memory in the accelerator for the matrix operation. If the memory space is large enough, the N can be the number of subsets included in the first matrix or the number of subsets included in the second matrix. The number of sets; if the memory space is limited, the N can be a multiple of the number of PEs included in the accelerator for the matrix operation.
  • the accelerator for matrix operations has an independent memory and a DMA unit that can flexibly access data from the shared storage space, which reduces the number of times of data access between the accelerator for matrix operations and the shared storage space, and saves the time for accessing data. , which improves the efficiency of matrix operations.
  • the DMA may also acquire the second subset of the second matrix that does not participate in the matrix operation from the shared storage space, and store the acquired second subset of the second matrix that does not participate in the matrix operation in the memory. Second storage space. In this way, it can be ensured that the data of the matrix operation is loaded into the storage space corresponding to the memory from the shared storage space in an orderly manner, making it possible to perform the block matrix operation in an orderly and effective manner, and realizing the efficient matrix operation.
  • the DMA unit may also fetch the third matrix currently stored in the third storage space from the memory, Stored in the shared storage space, the third matrix is a matrix obtained by performing matrix operations on the first matrix and the second matrix.
  • the final result of the matrix operation can be output from the accelerator of the matrix operation to the shared storage space, thereby facilitating the processor to directly read the final result of the matrix operation from the shared storage space.
  • the CTRL unit may also send an interrupt instruction to the processor, where the interrupt instruction is used to inform the processor of the first matrix and the second matrix.
  • the matrix operation has been completed, so that the processor can obtain the final result of the matrix operation from the shared storage space, providing a reliable data basis for subsequent calculations, analysis, etc.
  • the PE in the matrix operation accelerator may include, for example, a multiplier and an adder, and the first input terminal and the second input terminal of the multiplier are respectively connected to the first storage space and the second storage space of the memory.
  • the output end of the multiplier is connected to the first input end of the adder, the second input end of the adder is connected to the third storage space of the memory, and the output end of the adder is connected to the third storage space of the memory.
  • the multiplier can multiply the elements in the subset of the first matrix and the elements in the subset of the second matrix; the adder can multiply the calculation results of multiple multipliers and the subset of the current third matrix in the third storage space.
  • the elements of are added, and the elements in the subset of the third matrix in the third storage space are updated using the result of the addition operation.
  • the multiplication of the subset of the first matrix and the subset of the second matrix is realized, so that the matrix operation accelerator can accurately and efficiently complete the matrix operation.
  • the PE in the matrix operation accelerator may include, for example, a multiplier, an adder, and a register, and the first input terminal and the second input terminal of the multiplier are respectively connected to the first storage space and the first storage space of the memory.
  • Two storage spaces, the output of the multiplier and the output of the register are both connected to the input of the adder, the output of the adder is connected to the input of the register, and the output of the adder is also connected to the third storage space of the memory.
  • the register can store the elements in the current subset of the third matrix in the third storage space; the multiplier can multiply the elements in the subset of the first matrix and the elements in the subset of the second matrix; the adder can multiply The calculation results of the multipliers and the elements in the subset of the third matrix currently in the register are added, and the elements in the subset of the third matrix in the third storage space are updated using the result of the addition operation.
  • the registers in this implementation only play the role of data cache in the PE to reduce the number of times the PE accesses data from the memory during the matrix operation, thereby improving the processing efficiency of the matrix operation.
  • the number of multipliers included in the PE is related to the size of the subset of the first matrix and the size of the subset of the second matrix.
  • the scale of the subset of the first matrix and the scale of the subset of the second matrix are both 4 ⁇ 4, then, 4 multipliers can be set in PE; another example: the scale of the subset of the first matrix and the scale of the second matrix
  • the size of the subsets of the matrix are all 8 ⁇ 8, then, 8 multipliers can be set in the PE.
  • the present application also provides a method for matrix operation.
  • the method is applied to an accelerator for matrix operation, and the accelerator for matrix operation is used to perform matrix operation.
  • the method may include: in response to the received matrix operation instruction, The subset of the first matrix and the subset of the second matrix are stored in the first storage space and the second storage space of the memory, respectively, and the subset obtained by multiplying the subset of the first matrix and the subset of the second matrix Stored in the third storage space of the memory, wherein the matrix operation instruction is used to instruct the matrix operation to be performed on the first matrix and the second matrix, and the third storage space is used to store the subset based on the first matrix and the second matrix.
  • a third matrix composed of subsets obtained by multiplying the subsets; then, performing matrix operations on the subsets of the first matrix and the subsets of the second matrix according to the matrix operation instructions to obtain the results of the matrix operations.
  • performing the matrix operation on the subset of the first matrix and the subset of the second matrix according to the matrix operation instruction may include, for example, performing the matrix operation on the subset of the first matrix and the subset of the second matrix according to the matrix operation instruction.
  • a subset of two matrices performs parallel matrix operations.
  • the method provided by the present application may further include: updating a subset of the third matrix in the third storage space based on the result of the matrix operation, where the subset of the third matrix is a subset of the first matrix Obtained after matrix operations with a subset of the second matrix.
  • the method provided in this embodiment of the present application may further include: dividing the first matrix and the second matrix into blocks based on a matrix operation instruction to obtain multiple first subsets and sums of the first matrix. a plurality of second subsets of the second matrix.
  • the method provided by this embodiment of the present application may further include: obtaining N first subsets of the first matrix and N first subsets of the second matrix from a shared storage space according to a result of the block
  • N is greater than or equal to the number of processing units PE included in the matrix operation accelerator, N is a positive integer, and the shared storage space is the storage space shared by the processor and the matrix operation accelerator;
  • the subset and the subset of the second matrix are respectively stored in the first storage space and the second storage space of the memory, for example, it may include: storing the N first subsets in the first storage space of the memory;
  • the subset is stored in the second storage space of the memory.
  • the method provided by the embodiment of the present application may further include: acquiring the first subset of the first matrix that does not participate in the matrix operation from the shared storage space, and storing the acquired first matrix that does not participate in the matrix operation The first subset of is stored in the first storage space of the memory.
  • the method provided by this embodiment of the present application may further include: acquiring a second subset of the second matrix that does not participate in the matrix operation from the shared storage space, and assigning the acquired second matrix to the matrix operation without participating in the matrix operation.
  • the second subset of is stored in the second storage space of the memory.
  • the method provided by this embodiment of the present application may further include: The three matrices are retrieved from the memory and stored in the shared storage space, and the third matrix is a matrix obtained by performing matrix operations on the first matrix and the second matrix.
  • the method provided by the embodiment of the present application may further include: sending an interrupt instruction to the processor, where the interrupt instruction is used to notify completion of the matrix operation on the first matrix and the second matrix.
  • the matrix operation accelerator for implementing the method may include: a processing unit PE, where the PE includes a multiplier and an adder, wherein the first input terminal and the second input terminal of the multiplier are respectively connected to The first storage space and the second storage space of the memory, the output of the multiplier is connected to the first input of the adder, the second input of the adder is connected to the third storage space of the memory, and the output of the adder is connected to the first input of the memory.
  • the PE includes a multiplier and an adder, wherein the first input terminal and the second input terminal of the multiplier are respectively connected to The first storage space and the second storage space of the memory, the output of the multiplier is connected to the first input of the adder, the second input of the adder is connected to the third storage space of the memory, and the output of the adder is connected to the first input of the memory.
  • the process of performing the matrix operation in the PE may include: multiplying the elements in the subset of the first matrix and the elements in the subset of the second matrix by the multiplier; The elements in the subset of the current third matrix in the third storage space are added, and the elements in the subset of the third matrix in the third storage space are updated using the result of the addition operation.
  • the matrix operation accelerator for implementing the method may include: a processing unit PE, where the PE includes a multiplier, an adder and a register, and the first input end and the second input end of the multiplier are respectively connected to The first storage space and the second storage space of the memory, the output of the multiplier and the output of the register are connected to the input of the adder, the output of the adder is connected to the input of the register, and the output of the adder is also connected to the memory. third storage space.
  • the process of performing the matrix operation in the PE may include: the register stores the elements in the subset of the current third matrix in the third storage space; the multiplier stores the elements in the subset of the first matrix and the elements in the subset of the second matrix.
  • the adder adds the calculation results of the multiple multipliers and the elements in the subset of the current third matrix in the third storage space correspondingly, and uses the result of the addition operation to update the third matrix in the third storage space. elements in the subset.
  • the number of multipliers included in the PE is related to the size of the subset of the first matrix and the size of the subset of the second matrix.
  • the method provided in the second aspect is implemented by the accelerator for matrix operations provided in the first aspect.
  • the accelerator for matrix operations provided in the first aspect.
  • the present application further provides an apparatus for matrix operation, the apparatus including each module for performing the method for matrix operation in the second aspect or any possible implementation manner of the second aspect.
  • the present application also provides a matrix operation device, the matrix operation device includes a processor and a memory; a memory for storing computer instructions; a processor for executing the second aspect or the second according to the computer instructions. Operation steps of the method for matrix operation in any possible implementation manner of the aspect.
  • the present application also provides a device, the device includes a processor, a shared storage space, and an accelerator for matrix operations provided in the first aspect or any possible implementation manner of the first aspect, the processor and the matrix operations.
  • the accelerator shares the shared storage space, wherein: the processor is used to send a matrix operation instruction to the matrix operation accelerator; the matrix operation accelerator is used to perform the above-mentioned second aspect or the matrix operation instruction on the matrix in the shared storage space based on the matrix operation instruction.
  • the method provided in any possible implementation manner of the second aspect implements a matrix operation.
  • the present application provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, when the computer-readable storage medium runs on a computer, the computer performs the operation steps of the methods of the above aspects.
  • the present application provides a computer program product comprising instructions that, when run on a computer, cause the computer to perform the operation steps of the methods of the above aspects.
  • the present application may further combine to provide more implementation manners.
  • FIG. 1 is a schematic diagram of the logical architecture of a system 10 suitable for matrix operations in the present application;
  • FIG. 2 is a schematic diagram of the logical architecture of the computing module involved in the execution of a multiplication and addition process by PE 131 provided by the application;
  • Fig. 3 is a flow chart of a method for matrix operation provided by the application
  • FIG. 4 is a schematic diagram of each PE provided by the present application performing a block multiply operation
  • FIG. 5 is a schematic structural diagram of a device for matrix operation provided by the application.
  • FIG. 6 is a schematic structural diagram of a matrix operation device provided by the present application.
  • FIG. 1 is a schematic diagram of the logical architecture of a system 10 suitable for matrix operations in this application.
  • the system 10 includes: an accelerator 100 for matrix operations, a processor 200, a shared storage space 300 and a bus 400, and an accelerator for matrix operations. 100 and the processor 200 share the storage space in the main memory 300 through the bus 400 .
  • the system 10 may specifically be a device with a matrix operation function, for example, the system 10 may be a computing device, specifically a server.
  • the accelerator 100 and the processor 200 for matrix operations may be two independent chips, or may be two modules integrated in one chip, which are not limited in this application.
  • the processor 200 may be, for example, a central processing unit (central processing unit, CPU), a field-programmable gate array (field-programmable gate array, FPGA), an application specific integrated circuit (application specific integrated circuit, ASIC), Graphics processing unit (graphics processing unit, GPU), etc.
  • the processor 200 is taken as the CPU 200 as an example for description.
  • the shared storage space 300 may be, for example, main memory or any other storage space that can be shared by the processor 200 and the accelerator 100 for matrix operations.
  • Matrix operation refers to the process of operating at least two matrices to obtain a result matrix.
  • matrix operations are widely used in scientific computing such as large-scale scientific computing, large-scale engineering computing, and numerical simulation.
  • matrix operations are often optimized into efficient, portable linear algebra packages.
  • matrix operations mainly include: matrix multiplication, matrix exponentiation, matrix division, etc., and most matrix operations can be converted into matrix multiplication, so the program corresponding to matrix multiplication can be regarded as a linear algebra software package
  • basic linear algebra subprograms BLAS
  • general matrix multiplication General matrix The program corresponding to multiplication, GEMM is the core of the BLAS.
  • the matrix multiplication is introduced by taking the multiplication of matrix A and matrix B to obtain matrix C as an example.
  • the conditions for multiplying matrix A and matrix B are: the number of columns included in matrix A and the number of rows included in matrix B are the same.
  • Each element in matrix C is obtained by correspondingly multiplying each element in a row in matrix A and each element in a column in matrix B, and accumulating the products, for example: the first in matrix C the jth element of row i Among them, N is the number of columns included in matrix A, N is also the number of rows included in matrix B, a ik is the k-th element of the i-th row of the matrix A, and b kj is the j-th element of the k-th row of the matrix B .
  • the process of calculating one element in the matrix C is simply referred to as a multiplication and addition process.
  • the accelerator 100 for matrix operation is used for receiving the matrix operation instruction sent by the CPU 200, and based on the matrix operation instruction, performs matrix operation on the matrix to be operated stored in the main memory 300.
  • the accelerator 100 for matrix operations includes: a control (control, CTRL) unit 110, a memory 120, a processing unit (process element, PE) 131, PE 132, PE 133 and PE 134.
  • the accelerator 100 for matrix operations further includes a direct memory access (DMA) unit 140 .
  • DMA direct memory access
  • the CTRL unit 110 is used to receive the matrix operation instruction sent by the CPU 200, and based on the matrix operation instruction, the first matrix and the second matrix to be subjected to the matrix operation are divided into blocks, and the DMA unit 140 is sent to the DMA unit 140 based on the division result.
  • An instruction is sent to instruct the DMA unit 140 to perform a data access operation.
  • the DMA unit 140 is configured to acquire a subset of the first matrix from the main memory 300 and store it in the first storage space of the memory 120 according to the instruction of the CTRL unit 110 .
  • the CTRL unit 110 is further configured to send an operation instruction to each PE, and the multiple PEs are used to obtain a subset of the first matrix and a subset of the second matrix from the first storage space and the second storage space respectively according to the operation instruction sent by the CTRL unit 110. Subset, perform matrix operation on the subset of the first matrix and the subset of the second matrix to obtain the subset of the third matrix, and store the subset of the third matrix in a corresponding position of the third storage space. In this way, after completing the matrix operations on all subsets of the first matrix and all subsets of the second matrix, the DMA unit 140 is further configured to read the third matrix from the third storage space of the memory 120 and store it in the main memory 300 .
  • the multiple PEs are all connected to the memory 120 , and the multiple PEs are all controlled by the CTRL unit 110 .
  • matrix A and matrix B can be divided into blocks.
  • the method can be as follows: the matrix composed of elements from the 0th row to the 3rd row of matrix A is denoted as subset A0, the matrix composed of elements from the 4th row to the 7th row is denoted as subset A1, and the matrix from the 8th row to the 7th row is denoted as subset A1.
  • the matrix composed of the elements in the 11th row is denoted as subset A2, and the matrix composed of the elements from the 12th row to the 15th row is denoted as the subset A3; similarly, the matrix B composed of the elements from the 0th column to the 3rd column is composed of The matrix is denoted as subset B0, the matrix composed of elements from columns 4 to 7 is denoted as subset B1, the matrix composed of elements from columns 8 to 11 is denoted as subset B2, and the matrix composed of elements from columns 12 to 11 is denoted as subset B2.
  • the matrix composed of the elements in the 15th column is denoted as subset B3.
  • matrix A can be divided into four 4 ⁇ 4 subsets A0-A3, and matrix B can be divided into four 4 ⁇ 4 subsets B0-B3.
  • each subset obtained by dividing a matrix must include consecutive elements in the matrix, and any element in the matrix is only included in one subset, and all elements in the matrix are included in the subset.
  • the memory 120 can be divided into three storage spaces: storage space A, storage space B, and storage space C, wherein storage space A is used to store a subset of matrix A, and storage space B is used to store a subset of matrix B.
  • the subset is used to store the matrix C in the storage space C.
  • the subset is a set of some elements in the matrix, for example, the matrix is divided into multiple square matrices.
  • FIG. 2 is a schematic diagram of the logical architecture of the computing modules involved in the PE 131 performing a multiply-add process.
  • the multiplication and addition process performed by PE 131 is: based on the first row ⁇ a 00 , a 01 , a 02 , a 03 ⁇ of A0 and the first column ⁇ b 00 , b 10 , b of B0 20 , b 30 ⁇
  • the process of obtaining the first element of the first row of C00, then, the calculation modules involved in the multiplication and addition process may include: multiplier 1, multiplier 2, multiplier 3, multiplier 4, adder 1.
  • one input terminal of the multipliers 1-4 is respectively connected to the storage units corresponding to a 00 , a 01 , a 02 and a 03 in the first storage block of the storage space A of the memory 120 , and the other terminals of the multipliers 1-4 are respectively connected.
  • One input terminal is respectively connected to the storage units corresponding to b 00 , b 10 , b 20 and b 30 in the first storage block of the storage space B of the memory 120 , and the output terminals of the multiplier 1 and the multiplier 2 are connected to the adder 1
  • One input of the multiplier 3 and the output of the multiplier 4 is connected to the other input of the adder 2
  • the outputs of the adder 1 and the adder 2 are connected to the input of the adder 3
  • the terminal is connected to the input terminal of register 1, the output terminal of register 1 is connected to one input terminal of adder 4,
  • the other input terminal of adder 4 is connected to the output terminal of register 2, and the output terminal of adder 4 is connected to the input terminal of register 2 and
  • the storage unit corresponding to the first element of the first row of the subset C00 is stored in the first storage block of the storage space C of the memory 120 .
  • a multiplication and addition process in the process that the PE 131 performs S11 may include:
  • the multiplier 1 reads a 00 and b 00 from the storage space A and the storage space B respectively, and calculates a 00 ⁇ b 00 to obtain C 0
  • the multiplier 2 reads from the storage space A and the storage space B respectively.
  • a 01 and b 10 and will calculate a 01 ⁇ b 10 to obtain C 1
  • the multiplier 3 reads a 02 and b 20 from storage space A and storage space B respectively, and will calculate a 02 ⁇ b 20 to obtain C 2
  • the multiplier 4 reads a 03 and b 30 from storage space A and storage space B respectively, and calculates a 03 ⁇ b 30 to obtain C 3 ;
  • the adder 4 refreshes the C current in the register 2 with C 123 , and stores the C 123 in the storage unit corresponding to the first element of the first row of the storage subset C00 in the first storage block of the storage space C.
  • the multiplier can be any circuit module with multiplication function
  • the adder can be any circuit module with addition function, whether it is the circuit module corresponding to the multiplier or the circuit module corresponding to the adder, its input
  • the number of outputs can be flexibly designed based on needs.
  • adders 1-3 may be replaced by an adder comprising four inputs and one output.
  • the above-mentioned register 1 and register 2 only function as data buffers in the PE 131 to improve the processing efficiency of the multiply-add process.
  • the PE 131 may only include the register 2. Then, the output end of the adder 3 can be directly connected to the input end of the adder 4.
  • the PE 131 may not include a register, then the output end of the adder 3 is directly connected to the input end of the adder 4, and the other input end of the adder 4 is connected to the first storage block of the storage space C
  • the input terminal is connected to the storage unit corresponding to the first element of the first row of the storage subset C00 in the first storage block of the storage space C, from which the current data of the storage unit is read, and the output terminal of the adder 3 is also connected to the storage unit. unit, refresh the current data of the storage unit with the accumulated result.
  • the memory 120 may specifically be a volatile memory or a non-volatile memory, wherein the non-volatile memory may be a read-only memory (read-only memory, ROM), a programmable read-only memory (programmable ROM, PROM) or flash memory, etc., the volatile memory may be random access memory (random access memory, RAM), etc., which is not limited in this application.
  • the non-volatile memory may be a read-only memory (read-only memory, ROM), a programmable read-only memory (programmable ROM, PROM) or flash memory, etc.
  • the volatile memory may be random access memory (random access memory, RAM), etc., which is not limited in this application.
  • system architecture shown in FIG. 1 is only an example of the system architecture provided by the method for matrix operation provided by the present application, and the PE 131 shown in FIG. 2 performs a multiplication and addition process involved.
  • the logical architecture of the computing module is only an example of the PE structure provided by the matrix operation method provided by the present application, and does not constitute a limitation to the embodiments of the present application.
  • the present application provides a method for matrix operation.
  • a processor sends a matrix operation instruction to an accelerator for matrix operation, instructing the accelerator for matrix operation to perform matrix operation on the first matrix and the second matrix. Then, the matrix operation is performed by the matrix operation.
  • the accelerator blocks the two matrices to obtain multiple first subsets of the first matrix and multiple second subsets of the second matrix, and removes some or all of the first subsets and second subsets from the main Store the first storage space and the second storage space of the memory corresponding to the accelerator loaded into the matrix operation, perform matrix operations on the first subset and the second subset according to the matrix operation instruction, and combine the first subset and the second subset.
  • the result of the matrix operation corresponding to the subset is stored in the third storage space of the memory, and the final data in the third storage space is the result matrix obtained by performing the matrix operation on the first matrix and the second matrix.
  • this method uses a special accelerator for matrix operations to perform matrix operations.
  • the accelerator for matrix operations has internal memory, so that matrix operations are no longer limited by the resources of registers in the processor, reducing the accelerator and main memory for matrix operations.
  • the matrix operation accelerator calculates the matrices involved in the operation, so that the matrix operation is no longer limited by the processor. Computational ability, can complete large-scale matrix operations in a short time, and realize efficient matrix operations.
  • the memory 120 is divided into several storage spaces, and each storage space is used to store all or part of the data of a matrix in the matrix operation.
  • a storage space is divided into several storage blocks, and each storage block is used to include a subset after the matrix is partitioned.
  • a memory block is divided into several memory cells, and each memory cell is used to store one element of the matrix.
  • the method for matrix operation provided by the present application is described in detail in conjunction with FIG. 3 .
  • the method includes:
  • the CPU 200 sends a matrix operation instruction to the CTRL unit 110 of the accelerator 100 for matrix operation, where the matrix operation instruction is used to instruct to perform matrix operation on the first matrix and the second matrix.
  • the matrix operation instruction in S301 may specifically be the program code written into the program space of the main memory 300 by the CPU 200, and the CTRL unit 110 obtains the program code from the program space of the main memory 300 and decodes it to obtain the matrix Operation instructions.
  • the matrix operation instruction is used to instruct the accelerator 100 for matrix operation to perform matrix operation between the first matrix and the second matrix.
  • the matrix operation instruction can also indicate the relevant information of the matrices participating in the matrix operation, such as: the starting address and matrix size of each matrix participating in the matrix operation.
  • the matrix operation instruction may specifically include: indication information 1, starting address 1 of the first matrix, size 1 of the first matrix, starting address 2 of the second matrix, and size 2 of the second matrix, wherein the indication Information 1 is used to instruct to perform matrix multiplication between the first matrix and the second matrix.
  • the size of the first matrix 1 may be 16 ⁇ 4, the size of the second matrix 2 may be 4 ⁇ 16, and the starting address 1 is the first matrix in The starting address of matrix A is stored in the data space of the main memory 300 , and the starting address 2 is the starting address of the second matrix storing matrix B in the data space of the main memory 300 .
  • the main memory 300 includes a data space and a program space, wherein the data space is used to store operands, and the program space is used to store program codes corresponding to various instructions.
  • the main memory 300 can reserve a part of the program space for the accelerator 100 of matrix operation, and the CPU 200 can write the program code corresponding to the matrix operation instruction in the reserved program space to instruct the accelerator 100 of matrix operation based on the program code of the matrix operation.
  • the matrix operation instructions perform corresponding matrix operations.
  • the CTRL unit 110 divides the first matrix and the second matrix into blocks based on the matrix operation instruction to obtain multiple first subsets of the first matrix and multiple second subsets of the second matrix.
  • the CTRL unit 110 can determine that the matrix multiplication operation needs to be performed on the first matrix and the second matrix. In order to make full use of the resources in the accelerator 100 for matrix operation and realize efficient matrix operation, the CTRL unit 110 performs block processing on the two matrices participating in the matrix operation. Each block obtained after the block processing is called a subset, and each subset includes at least one element.
  • Perform block processing on the matrix specifically, dividing several elements in at least one continuous row or continuous column in the matrix into a subset.
  • Each subset obtained by dividing a matrix must include consecutive elements in the matrix, any element in the matrix can only be included in a subset, and all elements in the matrix need to be included in a subset.
  • the multiplication specifically refers to: the number of columns included in the subset of the first matrix and the The subset includes the same number of rows.
  • subsets obtained by matrix division may have the same scale or different scales, as long as it is ensured that the subsets of the two matrices after division can be multiplied.
  • the remaining elements cannot form the subset of this size, then, the remaining elements can be divided into at least A subset of this size, the process of performing matrix operations is not affected by the zero-padding operation.
  • the matrix is divided into square matrices (that is, each subset is a square matrix), and the scales of the two matrix division subsets participating in the matrix operation are the same as an example for description.
  • the scale of the first matrix is 16 ⁇ 4, and the scale of the second matrix is 4 ⁇ 16.
  • the way of dividing the first matrix and the second matrix by the CTRL unit 110 may include: Mode 1, the subset is 1 ⁇ 1 square matrix, then, 64 first subsets and 64 second subsets are obtained after blocking, each subset includes 1 element; in mode 2, the subset is a 2 ⁇ 2 square matrix, then, block Then, 16 first subsets and 16 second subsets are obtained, and each subset includes 4 consecutive elements; in mode 3, the subset is a 4 ⁇ 4 square matrix, then, 4 first subsets are obtained after block set and 4 second subsets, each subset consists of consecutive 16 elements.
  • the CTRL unit 110 sends a first command to the DMA unit 140, where the first command is used to instruct the DMA unit 140 to acquire the first subset of the first matrix and the second subset of the second matrix.
  • the DMA unit 140 acquires the first subset of the first matrix and the second subset of the second matrix from the main memory 300 .
  • the DMA unit 140 stores the first subset of the first matrix and the second subset of the second matrix into the first storage space and the second storage space of the memory 120, respectively.
  • the CTRL unit 110 may generate and send a first command to the DMA unit 140 based on the block result and the resources of the memory 120 , instructing the DMA unit 140 to remove the N first subsets and the N second subsets from the main memory 300 .
  • N is an integer greater than or equal to the number of PEs included in the accelerator 100 for matrix operations, corresponding to the system 10, N ⁇ 4.
  • the value of N is an integer multiple of the number of PEs included in the accelerator 100 for matrix operations.
  • N can take 4n (n takes an integer from 1 to 16); if the scale of the first subset and the second subset is 2 ⁇ 2, then N can take 4m (m takes 1 to 4 Integer); if the size of the first subset and the second subset is 4 ⁇ 4, then N can take 4.
  • the storage area of the memory 120 is divided into a plurality of storage spaces, and each storage space is used to store data of one matrix. For example, if a matrix operation is performed on the first matrix and the second matrix, the memory 120 divides the storage area into three storage spaces: the first storage space, the second storage space and the third storage space, wherein the first storage space used to store part or all of the first subset of the first matrix moved by the DMA unit 140, the second storage space is used to store part or all of the second subset of the second matrix moved by the DMA unit 140, and the third storage space It is used to store the intermediate result or final result (ie, the third matrix) obtained after each PE performs the matrix operation. In the initial state (ie, when the matrix operation is not performed), the third storage space is empty.
  • the DMA unit 140 After the DMA unit 140 receives the first command, it can obtain all or part of the first subset and the second subset from the main memory 300 based on the first command, and store the obtained first subset and the second subset.
  • the sets are stored in the first storage space and the second storage space of the memory 120, respectively.
  • the first storage space of the memory 120 includes: A0-A3
  • the second storage space includes: B0-B3.
  • A0 to A3 and B0 to B3 are all 4 ⁇ 4 square matrices.
  • the CTRL unit 110 sends a second command to each PE, where the second command is used to instruct each PE to perform a corresponding matrix operation.
  • each PE obtains the first subset of the first matrix and the second subset of the second matrix from the first storage space and the second storage space of the memory, respectively, based on the second command.
  • each PE performs a matrix operation on the acquired first subset and the second subset in parallel based on the second command to obtain a third subset, and stores the third subset in the third storage space of the memory 120 .
  • Each PE may determine, based on the second command sent by the CTRL unit 110, a memory block responsible for performing matrix multiplication, and perform matrix multiplication on a subset of the determined memory blocks. It should be noted that, since the matrix operations performed by each PE can be parallel, the operations performed by each PE in parallel matrix operations are the same. Therefore, only the interactive process of PE 131 in the matrix operation is shown in FIG. 3 .
  • the matrix operation operation of the PE 131 is taken as an example to illustrate the parallel operation performed by each PE in the matrix operation.
  • the PE performs a matrix operation on the acquired first subset and the second subset to obtain a third subset, and stores the third subset in the third storage space of the memory 120, which may be the first subset
  • a block multiplication operation is performed with the second subset, and the block multiplication result is taken as the third subset corresponding to the matrix multiplication operation of the first subset and the second subset, and stored in the corresponding position in the third storage space.
  • the third subset C00 is obtained, and C00 is stored in the first storage block of the third storage space.
  • a block multiplication operation includes at least one multiplication and addition operation.
  • the three storage spaces of the memory 120 are divided into 4 storage blocks, each storage block stores a subset, the first storage space includes A0-A3, and the second storage space includes B0-B3
  • storage block 0-storage block 3 of the first storage space save A0-A3 respectively
  • storage block 4-storage block 7 of the second storage space store B0-B3 respectively
  • storage block 8-storage block of the third storage space 23 respectively save C00, C01, C02, C03, C10, C11, C12, C13, C20, C21, C22, C23, C30, C31, C32 and C33.
  • C00 ⁇ C33 are all equal to 0, that is, the memory block 8 to 23 are all empty.
  • PE 131 corresponds to storage block 0 and storage block 8 - storage block 11
  • PE 132 corresponds to storage block 1 and storage block 12 to storage block 15
  • PE 133 corresponds to storage block 2 and storage block 16 - storage block 19
  • PE 134 corresponds to memory block 3 and memory block 20 to memory block 23.
  • PE 131 since the second storage space includes B0 to B3, four block multiplication operations need to be performed, and each block multiplication operation corresponds to a storage block of the second storage space.
  • the process by which PE 131 performs matrix operations may include:
  • PE 131 obtains A0 from storage block 0, obtains B0 from storage block 4, calculates A0 ⁇ B0, obtains C00, and stores C00 in storage block 8 of the third storage space of memory 120;
  • PE 131 obtains B1 from storage block 5, calculates A0 ⁇ B1, obtains C01, and stores C01 in storage block 9 of the third storage space of memory 120;
  • PE 131 obtains B2 from storage block 6, calculates A0 ⁇ B2, obtains C02, and stores C02 in storage block 10 of the third storage space of memory 120;
  • each step in S21-S24 represents a process of a block multiplication operation performed by the PE 131.
  • Figure 4 shows the process of PE1 131 performing the block multiplication operation corresponding to S21, and PE 132 obtains A1 from storage block 1, obtains B1 from storage block 5, calculates A1 ⁇ B1, obtains C11, and stores C11 in The process of the storage block 13 of the third storage space of the memory 120; the PE 133 obtains A2 from the storage block 2, obtains B2 from the storage block 6, calculates A2 ⁇ B2, obtains C22, and stores C22 in the third storage block 120.
  • the process of the storage block 18 of the storage space; the PE 134 obtains A3 from the storage block 3, obtains B3 from the storage block 7, calculates A3 ⁇ B3, obtains C33, and stores C33 in the storage block of the third storage space of the memory 120 23 process.
  • each PE performs a parallel matrix operation, which specifically means that after each PE acquires the first subset from the storage block corresponding to the first storage space, it sequentially acquires each second subset from each storage block in the second storage space , after using the first subset to perform the block multiplication operation with the obtained second subset respectively, the obtained third subset is stored in the storage block corresponding to the PE in the third storage space.
  • the number of times each PE performs the block multiplication operation may be equal to the number of the second subsets participating in the matrix operation in S308.
  • Several third subsets are stored in the third storage space.
  • the number of third subsets is equal to the product of the number of first subsets and the number of second subsets that perform parallel matrix operations; another example , assuming that for the operation of multiplying matrix B by matrix A, after S308 is executed, the number of third subsets is equal to 1, and each PE accumulates the third subset calculated by it to the current data in the third storage space to obtain the final
  • the matrix C is four 4 ⁇ 4 square matrices.
  • each PE in the process of performing parallel matrix operations on each PE, the operations performed by each PE are independent operations and are not affected by other PEs, and the speed of each PE performing matrix operations does not affect other PEs.
  • Performing parallel matrix operations on different subsets of the two matrices through multiple PEs can effectively improve the rate of matrix operations.
  • the CTRL unit 110 judges whether the matrix operation on the first matrix and the second matrix is completed, if not, execute the following S310, and if so, execute S311.
  • the CTRL unit 110 sends a third command to the DMA unit 140, where the third command is used to instruct the DMA unit 140 to obtain the first subset of the first matrix or the second subset of the second matrix that is not loaded, and returns to execute S304 .
  • the CTRL unit 110 writes a third matrix to the main memory 300 through the DMA unit 140, where the third matrix is a result matrix obtained by performing matrix operations on the first matrix and the second matrix.
  • the CTRL unit 110 will confirm whether there is a first subset or a second subset that does not participate in the matrix operation, and if so, and determine that the matrix operation of the first matrix and the second matrix has not been completed. , execute S310 to continue the unfinished matrix operation process; if it is confirmed that there is no first subset and second subset that do not participate in the matrix operation, then, it is determined that the matrix operation of the first matrix and the second matrix has been completed, you can
  • S311 is performed to write a third matrix into the main memory 300 through the DMA unit 140, where the third matrix is a result matrix obtained by performing matrix operations on the first matrix and the second matrix.
  • the CTRL unit 110 when it is determined that the matrix operations of the first matrix and the second matrix are not completed, the CTRL unit 110 sends a third command to the DMA unit 140, instructing the DMA unit 140 to continue to obtain the unloaded data from the main memory 300. Enter the first subset of the first matrix or the second subset of the second matrix, and return to execute S304-S308 until the matrix operation is completed.
  • A0-A7 and B0-B7 are obtained after dividing the two matrices into blocks, where A0-A3 is the first column A subset, A4-A7 are the first subset in the second column, B0-B3 are the second subset in the first row, and B4-B7 are the second subset in the second row.
  • the above S304-S308 perform parallel matrix operations on the first subset in the first column and the second subset in the first row to obtain the third subsets C00-C33.
  • S310 and S304-S308 can be executed three times, and the obtained third subset is accumulated on the storage block corresponding to the current third storage space, A new third subset is obtained, and the set of all third subsets obtained after three accumulations is denoted as the third matrix.
  • PE 131 may include:
  • the DMA unit 140 moves the second subset B4-B7 in the main memory 300 to the second storage space based on the third command, the CTRL unit 110 sends the second command to each PE, and the second command is used to instruct each PE Perform corresponding matrix operations;
  • PE 131 obtains the second subset B4-B7 of the second matrix from the second storage space of the memory based on the second command;
  • PE 131 calculates A0 ⁇ B4 to obtain C00', and accumulates C00' to the storage block 8 of the third storage space, calculates A0 ⁇ B5 to obtain C01', and accumulates C01' to the storage block 9 of the third storage space, A0 ⁇ B6 is calculated to obtain C02', and C02' is accumulated to the storage block 10 of the third storage space, and A0 ⁇ B7 is calculated to obtain C03', and C03' is accumulated to the storage block 11 of the third storage space.
  • the matrix operation process can also include:
  • the DMA unit 140 moves the first subsets A4 to A7 in the main memory 300 to the first storage space based on the third command, and the CTRL unit 110 sends a second command to each PE, where the second command is used to instruct each PE Perform corresponding matrix operations;
  • PE 131 obtains the first subset A4-A7 of the first matrix from the first storage space of the memory based on the second command;
  • PE 131 calculates A4 ⁇ B4 to obtain C00”, and accumulates C00” to the storage block 8 of the third storage space, calculates A4 ⁇ B5 to obtain C01”, and accumulates C01” to the storage block 9 of the third storage space, Calculate A4 ⁇ B6 to obtain C02", and accumulate C02" to the storage block 10 of the third storage space, calculate A4 ⁇ B7 to obtain C03", and accumulate C03" to the storage block 11 of the third storage space.
  • the matrix operation process can also include:
  • PE 131 obtains the second subset B0-B3 of the second matrix from the second storage space of the memory based on the second command;
  • PE 131 calculates A4 ⁇ B0 to obtain C00"', and accumulates C00"' to the storage block 8 of the third storage space, calculates A4 ⁇ B1 to obtain C01"', and accumulates C01"' to the storage block 8 of the third storage space Storage block 9, calculate A4 ⁇ B2 to get C02"', and accumulate C02"' to the storage block 10 of the third storage space, calculate A4 ⁇ B3 to get C03"', and accumulate C03"' to the third storage space Memory block 11.
  • PE 131 completes the matrix operation of the first matrix and the second matrix, and obtains four third subsets of the first row of the third matrix, and the four third subsets are respectively denoted as C00, C01, C02 and C03, in,
  • C02 A0 ⁇ B2+A0 ⁇ B6+A4 ⁇ B6+A4 ⁇ B2
  • C03 A0 ⁇ B3+A0 ⁇ B7+A4 ⁇ B7+A4 ⁇ B3.
  • the matrix operation process performed by other PEs is similar to the matrix operation process of the PE 131.
  • the matrix operation process of the PE 131 please refer to the above description of the matrix operation process of the PE 131, which will not be repeated here.
  • the CTRL unit 110 when it is determined that the matrix operations of the first matrix and the second matrix have been completed, the CTRL unit 110 sends a fourth command to the DMA unit 140, instructing the DMA unit 140 to store the obtained third matrix into the Main memory 300. Specifically, after receiving the fourth command sent by the CTRL unit 110 , the DMA unit 140 acquires the third matrix from the third storage space of the memory 120 and stores the third matrix in the main memory 300 . In addition, when it is determined that the matrix operations of the first matrix and the second matrix have been completed, the CTRL unit 110 may also send an interrupt instruction to the CPU 200, and the interrupt instruction is used to let the CPU 200 know that the accelerator 100 of the matrix operation has completed the operation of the matrix operation instruction. Indicates the operation of the first matrix and the second matrix.
  • the matrix operation accelerator divides the matrix participating in the operation in the main memory into blocks based on the instructions of the processor, so as to obtain multiple subsets of the matrices participating in the operation, and some or all of the matrices participating in the operation are obtained.
  • the subsets of the matrix operation are loaded from the shared storage space into different storage spaces of the memory of the accelerator of the matrix operation, and parallel matrix operations are performed on the subsets of different storage spaces according to the matrix operation instructions sent by the processor, and the results of the operation are calculated. It is stored in another storage space of the memory, and the final data in the other storage space is the result matrix obtained by performing matrix operations on the first matrix and the second matrix.
  • the matrix operation is performed by using a special matrix operation accelerator, and the matrix operation accelerator has a memory inside, so that the matrix operation is no longer limited by the resources of the registers in the processor, and the access between the matrix operation accelerator and the main memory is reduced.
  • the number of times of data is saved, the time for accessing data is saved, and the efficiency of matrix operation is improved; moreover, the matrix operation accelerator performs parallel calculation on the matrices involved in the operation, so that the matrix operation is no longer limited by the computing power of the processor, and can be used in Complete large-scale matrix operations in a short time, and realize efficient matrix operations.
  • the matrix operation accelerator provided by the present application is described in detail above in conjunction with FIG. 1 and FIG. 2 , and the matrix operation method provided by the present application is described in detail in conjunction with FIG. 3 and FIG. 4 .
  • the apparatus and device for matrix operation provided according to the present application.
  • the matrix operation device 500 is applied to an accelerator for matrix operation, and the matrix operation device 500 includes: a receiving unit 501, a storage unit 502 and an operation unit 503;
  • the receiving unit 501 is used for receiving a matrix operation instruction, and the matrix operation instruction is used for instructing to perform matrix operation on the first matrix and the second matrix;
  • the storage unit 502 is configured to store the subset of the first matrix and the subset of the second matrix in the first storage space and the second storage space of the memory, respectively, store the third matrix in the third storage space of the memory, and store the third matrix in the third storage space of the memory.
  • the matrix is a matrix composed of a subset obtained by multiplying a subset of the first matrix and a subset of the second matrix;
  • the operation unit 503 is configured to perform a matrix operation on the subset of the first matrix and the subset of the second matrix according to the matrix operation instruction to obtain a result of the matrix operation.
  • the operation unit 503 is specifically configured to perform parallel matrix operations on the subset of the first matrix and the subset of the second matrix according to the matrix operation instruction to obtain a result of the matrix operation.
  • the apparatus 500 for matrix operation may further include: an update unit;
  • an update unit configured to update the subset of the third matrix in the third storage space based on the result of the matrix operation, wherein the subset of the third matrix is obtained by performing matrix operations on the subset of the first matrix and the subset of the second matrix of.
  • the apparatus 500 for matrix operation may further include: a block unit;
  • the block unit is configured to block the first matrix and the second matrix based on the matrix operation instruction to obtain multiple first subsets of the first matrix and multiple second subsets of the second matrix.
  • the apparatus 500 for matrix operation may further include: a data access unit;
  • the data access unit is used to obtain N first subsets of the first matrix and N second subsets of the second matrix from the shared storage space according to the result of the block, and N is greater than or equal to the value of the accelerator for the matrix operation.
  • the number of included processing units PE, N is a positive integer
  • the shared storage space is the storage space shared by the processor and the accelerator for matrix operations;
  • the above storage unit 502 is specifically configured to: store the N first subsets in the first storage space of the memory; and store the N second subsets in the second storage space of the memory.
  • the data access unit is further configured to perform the matrix operation on the first subset in the first storage space and the second subset in the second storage space, and does not perform the matrix operation on the first matrix and the second matrix.
  • the first subset of the first matrix that does not participate in the matrix operation is obtained from the shared storage space, and the obtained first subset of the first matrix that does not participate in the matrix operation is stored in the memory. first storage space.
  • the data access unit is further configured to perform the matrix operation on the first subset in the first storage space and the second subset in the second storage space, and does not perform the matrix operation on the first matrix and the second matrix.
  • the second subset of the second matrix that does not participate in the matrix operation is obtained from the shared storage space, and the second subset of the obtained second matrix that does not participate in the matrix operation is stored in the memory. Second storage space.
  • the data access unit is further configured to take out the third matrix currently saved in the third storage space from the memory when completing the matrix operation on all subsets in the first matrix and the second matrix, and store it in the shared memory.
  • the third matrix is a matrix obtained by performing matrix operations on the first matrix and the second matrix.
  • the apparatus 500 for matrix operation may further include: a sending unit;
  • the sending unit is used for sending an interrupt instruction to the processor, where the interrupt instruction is used for informing the completion of the matrix operation on the first matrix and the second matrix.
  • the matrix operation accelerator applied by the apparatus for matrix operation may include: a processing unit PE, and the PE includes a multiplier and an adder, wherein the first input end and the second input end of the multiplier are respectively connected to the memory.
  • the first storage space and the second storage space, the output of the multiplier is connected to the first input of the adder, the second input of the adder is connected to the third storage space of the memory, and the output of the adder is connected to the third storage of the memory. space.
  • the process of performing the matrix operation in the PE may include: multiplying the elements in the subset of the first matrix and the elements in the subset of the second matrix by the multiplier; The elements in the subset of the current third matrix in the third storage space are added, and the elements in the subset of the third matrix in the third storage space are updated using the result of the addition operation.
  • the matrix operation accelerator applied by the device for matrix operation may include: a processing unit PE, the PE includes a multiplier, an adder and a register, and the first input end and the second input end of the multiplier are respectively connected to the memory.
  • the first storage space and the second storage space, the output of the multiplier and the output of the register are connected to the input of the adder, the output of the adder is connected to the input of the register, and the output of the adder is also connected to the third of the memory. storage.
  • the process of performing the matrix operation in the PE may include: the register stores the elements in the subset of the current third matrix in the third storage space; the multiplier stores the elements in the subset of the first matrix and the elements in the subset of the second matrix.
  • the adder adds the calculation results of the multiple multipliers and the elements in the subset of the current third matrix in the third storage space correspondingly, and uses the result of the addition operation to update the third matrix in the third storage space. elements in the subset.
  • the number of multipliers included in the PE is related to the size of the subset of the first matrix and the size of the subset of the second matrix.
  • the apparatus 500 in the embodiment of the present application may be implemented by an application-specific integrated circuit (ASIC), or a programmable logic device (PLD), and the above-mentioned PLD may be a complex program logic device (complex programmable logical device, CPLD), field-programmable gate array (field-programmable gate array, FPGA), general array logic (generic array logic, GAL) or any combination thereof.
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • CPLD complex programmable logical device
  • FPGA field-programmable gate array
  • GAL general array logic
  • the apparatus 500 and its respective modules can also be software modules.
  • the apparatus 500 for matrix operation may correspond to executing the methods described in the embodiments of the present application, and the above-mentioned and other operations and/or functions of the respective units in the apparatus 500 for matrix operations are to achieve the The corresponding flow of the method is not repeated here for brevity.
  • the matrix operation device 600 includes a processor 601, a memory 602, a communication interface 603 and a memory unit 604.
  • the processor 601, the memory 602, the communication interface 603, and the memory unit 604 communicate through the bus 605, and can also communicate through other means such as wireless transmission.
  • the memory 602 is used for storing instructions, and the processor 601 is used for executing the instructions stored in the memory 602 .
  • the memory 602 stores program codes, and the processor 601 can call the program codes stored in the memory 602 to perform the following operations:
  • the subset of the first matrix and the subset of the second matrix are respectively stored in the first storage space and the second storage space of the memory, and the third matrix is stored in the third storage space of the memory, the
  • the third matrix is a matrix based on a subset obtained by multiplying the subset of the first matrix and the subset of the second matrix;
  • Matrix operation is performed on the subset of the first matrix and the subset of the second matrix according to the matrix operation instruction to obtain a result of the matrix operation.
  • the processor 601 may be a CPU, and the processor 601 may also be other general-purpose processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • DSPs digital signal processors
  • ASICs application specific integrated circuits
  • FPGA field programmable gate arrays
  • a general purpose processor may be a microprocessor or any conventional processor or the like.
  • the memory 602 may include read only memory and random access memory, and provides instructions and data to the processor 601 .
  • Memory 602 may also include non-volatile random access memory.
  • memory 602 may also store device type information.
  • the memory 602 may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically programmable Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • Volatile memory may be random access memory (RAM), which acts as an external cache.
  • RAM static random access memory
  • DRAM dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • Double data rate synchronous dynamic random access memory double data date SDRAM, DDR SDRAM
  • enhanced synchronous dynamic random access memory enhanced SDRAM, ESDRAM
  • synchronous link dynamic random access memory direct rambus RAM, DR RAM
  • bus 605 may also include a power bus, a control bus, a status signal bus, and the like. However, for the sake of clarity, the various buses are labeled as bus 605 in the figure.
  • the apparatus 600 for matrix operation may correspond to the apparatus 500 for matrix operation in the embodiment of the present application, and may correspond to the corresponding subject for executing the method shown in FIG. 3 in the embodiment of the present application, Moreover, the above and other operations and/or functions of each module in the device 600 for matrix operation are respectively in order to implement the corresponding flow of each method in FIG. 3 , and are not repeated here for brevity.
  • the present application also provides a device, the device includes a processor, a shared storage space, and the above-mentioned accelerator for matrix operations as shown in FIG. 1 , the processor and the accelerator for matrix operations share the shared memory Storage space, wherein: the processor is used to send a matrix operation instruction to the accelerator of matrix operation; the accelerator of matrix operation is used to execute each of the above-mentioned methods as shown in FIG. 3 on the matrix in the shared storage space based on the matrix operation instruction.
  • the operation steps, which implement matrix operations, are not repeated here for brevity.
  • the above embodiments may be implemented in whole or in part by software, hardware, firmware or any other combination.
  • the above-described embodiments may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded or executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server or data center Transmission to another website site, computer, server, or data center is by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • the computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, or the like that contains one or more sets of available media.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media.
  • the semiconductor medium may be a solid state drive (SSD).

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Software Systems (AREA)
  • Advance Control (AREA)
  • Complex Calculations (AREA)

Abstract

一种矩阵运算的方法,应用于用于执行矩阵运算的加速器,矩阵运算的加速器(100),响应于所接收的矩阵运算指令,将第一矩阵的子集和第二矩阵的子集分别存储在存储器(120)的第一存储空间和第二存储空间,将第一矩阵的子集和第二矩阵的子集相乘后获得的子集存储在存储器(120)的第三存储空间;根据矩阵运算指令对第一矩阵的子集和第二矩阵的子集进行矩阵运算,得到矩阵运算的结果。利用专门的矩阵运算的加速器(100)进行矩阵运算,能够在较短的时间内完成大规模矩阵的运算,卸载了处理器的矩阵运算负担,使得矩阵运算不再受限于处理器中寄存器的资源以及处理器自身的计算能力,有效的提高了矩阵运算的效率。

Description

矩阵运算的方法和加速器 技术领域
本申请涉及计算机领域,尤其涉及一种矩阵运算的方法和加速器。
背景技术
矩阵运算的过程通常为:首先,处理器从主存储器(英文:main memory,下文中简称为主存)中将待进行矩阵运算的数据载入寄存器中;接着,由处理器对该寄存器中的数据进行矩阵运算后,得到矩阵运算的结果。可见,该矩阵运算依赖处理器的计算能力以及处理器中寄存器的资源。随着信息的爆炸式增长,参与矩阵运算的矩阵的规模的不断增大,由于处理器的计算能力以及处理器中寄存器的资源均有限,无法对规模较大的矩阵进行高效的矩阵运算。因此,如何提供一种高效的矩阵运算方法成为亟待解决的技术问题。
发明内容
本申请提供了一种矩阵运算的方法和加速器,使得矩阵运算不受限于处理器的计算能力以及处理器中寄存器的资源,能够高效的完成矩阵运算。
第一方面,本申请提供了一种矩阵运算的加速器,该加速器至少包括:控制(control,CTRL)单元、存储器和处理单元(process element,PE)。其中,CTRL单元用于接收矩阵运算指令;存储器用于将存储区域划分为多个存储空间,例如划分为第一存储空间、第二存储块和第三存储空间,那么,存储器用于在第一存储空间存储第一矩阵的子集、在第二存储空间存储第二矩阵的子集,在第三存储空间存储第三矩阵,该第一矩阵和第二矩阵为矩阵运算指令所指示的参与矩阵运算的两个矩阵,第三矩阵为基于第一矩阵的子集和第二矩阵的子集相乘后获得的子集组成的矩阵;PE负责基于所述矩阵运算指令对第一存储空间中第一矩阵的子集和第二存储空间中第二矩阵的子集进行矩阵运算,得到矩阵运算的结果。这样,利用专门的矩阵运算的加速器进行矩阵运算,能够在较短的时间内完成大规模矩阵的运算,卸载了处理器的矩阵运算的负担,使得矩阵运算不再受限于处理器中寄存器的资源以及处理器自身的计算能力,有效的提高了矩阵运算的效率。
在一种可能的实现方式中,该矩阵运算的加速器中包括至少一个PE。作为一个示例,当矩阵运算的加速器中包括多个PE时,该多个PE可以分别用于基于矩阵运算指令对第一存储空间中第一矩阵的子集和第二存储空间中第二矩阵的子集进行并行矩阵运算,得到矩阵运算的结果。这样,多个PE并行执行矩阵运算,使得矩阵运算的速度不再依赖某个PE的计算速度,即使对于大规模矩阵,该矩阵运算的加速器能够快速完成运算,大大的提高了矩阵运算的效率。
在另一种可能的实现方式中,该矩阵运算的加速器中的PE,还可以基于矩阵运算的结果更新第三存储空间中第三矩阵的子集,该第三矩阵的子集为第一矩阵的子集与第二矩阵的子集进行矩阵运算后获得的。例如:假设当前第三存储空间中的第三矩阵的子集为第一矩阵A的子集A0与第二矩阵B的子集B0相乘后的子集C0;而PE将第一矩阵A的子集A0与第二矩阵B的子集B1相乘后得到矩阵运算的结果C1,那么,PE基于矩阵运算的结果C1更新第三存储空间中第三矩阵的子集具体可以是:将C1累加到当前第三存储空间中的第三矩阵的子集C0上,更新后第三存储空间中第三矩阵的该子集为C0+C1。又例如:仍然假设当前第 三存储空间中的第三矩阵的子集为第一矩阵A的子集A0与第二矩阵B的子集B0相乘后的子集C0;而PE将第一矩阵A的子集A0与第二矩阵B的子集B1相乘后得到矩阵运算的结果C1,并计算(C0+C1)=C2,记作矩阵运算的结果,那么,PE基于矩阵运算的结果C2更新第三存储空间中第三矩阵的子集具体可以是:用C2替换当前第三存储空间中的第三矩阵的子集C0,更新后第三存储空间中第三矩阵的该子集为C2。可以理解的是,每个PE可以基于CTRL单元的在指示确定其所负责进行矩阵运算的子集,并确定其得到的矩阵运算的结果(可以是中间结构也可以是最终组成第三矩阵的结果)保存在第三存储空间中的位置。
在另一种可能的实现方式中,该矩阵运算的加速器中的CTRL单元在接收到矩阵运算指令后,在进行矩阵运算之前,还可以基于该矩阵运算指令,对第一矩阵和第二矩阵进行分块,得到第一矩阵的多个子集和第二矩阵的多个子集。其中,子集可以是由矩阵中至少一个连续行或连续列中的若干元素组成的。划分矩阵所得的每个子集必须包括该矩阵中连续的元素,矩阵中的任意一个元素仅能够被包含在一个子集中,且矩阵中的所有元素均需要被包含在一个子集。CTRL单元对矩阵进行划分,得到的子集的规模可以相同也可以不同,但需要确保分块后的第一矩阵的子集和第二矩阵的子集是可乘的,可乘具体指:第一矩阵的子集包括的列数和第二矩阵的子集包括的行数相同。作为一个示例,可以以将矩阵从按照从左到右从上到下的方式划分为预设规模的方阵,即,得到的矩阵的子集均为行数和列数相同的方阵。这样,通过CTRL单元对待进行运算的矩阵进行分块,使得该矩阵运算的加速器能够对分块后矩阵的子集进行分块运算,而且对于多个PE的矩阵运算加速器而言,为实现多个PE的并行矩阵运算提供了数据基础,使得快速、高效的矩阵运算成为可能。
在另一种可能的实现方式中,该矩阵运算的加速器还可以包括直接内存存取(direct memory access,DMA)单元,该DMA单元用于实现该矩阵运算的加速器进行矩阵运算时数据的存取操作。具体而言,该DMA单元可以根据CTRL单元分块的结果,从共享存储空间中获取第一矩阵的N个第一子集和第二矩阵的N个第二子集,并将N个第一子集和N个第二子集分别存入存储器的第一存储空间和第二存储空间,其中,N大于或等于该矩阵运算的加速器包括的PE的数量,N为正整数。其中,共享存储空间为处理器和该矩阵运算的加速器共享的存储空间,该共享存储空间例如可以是主存。需要说明的是,N的取值通常与该矩阵运算的加速器中存储器的大小相关,如果存储器的空间足够大,则,该N可以取第一矩阵包括子集的数量或第二矩阵包括的子集的数量;如果存储器的空间有限,则,该N可以取该矩阵运算的加速器所包括PE的数量的倍数。这样,在矩阵运算的加速器内具有独立的存储器且具有能够从共享存储空间灵活存取数据的DMA单元,减少矩阵运算的加速器与共享存储空间之间存取数据的次数,节约存取数据的时间,提高了矩阵运算的效率。
在另一种可能的实现方式中,当PE完成对第一存储空间中的第一子集和第二存储空间中的第二子集的矩阵运算,且未对第一矩阵和第二矩阵中的所有子集完成矩阵运算时,作为一个示例,DMA单元还可以从共享存储空间中获取第一矩阵未参与矩阵运算的第一子集,并将所获取的第一矩阵未参与矩阵运算的第一子集存入存储器的第一存储空间。或者,作为另一个示例,DMA还可以从共享存储空间中获取第二矩阵未参与矩阵运算的第二子集,并将所获取的第二矩阵未参与矩阵运算的第二子集存入存储器的第二存储空间。这样,能够确保矩阵运算的数据被有序的从共享存储空间载入存储器对应的存储空间,使得有序且有效进行分块矩阵运算成为可能,实现了高效的矩阵运算。
在另一种可能的实现方式中,当PE对第一矩阵和第二矩阵中的所有子集完成矩阵运算时,DMA单元还可以将第三存储空间中当前保存的第三矩阵从存储器取出,存入共享存储空 间中,第三矩阵为第一矩阵和第二矩阵进行矩阵运算得到的矩阵。这样,能够将矩阵运算的最终结果从该矩阵运算的加速器中输出到共享存储空间,从而方便处理器从该共享存储空间直接读取该矩阵运算的最终结果。
作为一个示例,当矩阵运算的加速器完成对第一矩阵和第二矩阵的矩阵运算时,CTRL单元还可以向处理器发送中断指令,该中断指令用于告知处理器对第一矩阵和第二矩阵的矩阵运算已经完成,这样,处理器即可从共享存储空间中获取该矩阵运算的最终结果,为后续计算、分析等提供了可靠的数据基础。
在另一种可能的实现方式中,矩阵运算的加速器中的PE例如可以包括乘法器和加法器,乘法器的第一输入端和第二输入端分别连接存储器的第一存储空间和第二存储空间,乘法器的输出端连接加法器的第一输入端,加法器的第二输入端连接存储器的第三存储空间,加法器的输出端连接存储器的所述第三存储空间。其中,乘法器可以对第一矩阵的子集中的元素和第二矩阵的子集中的元素相乘;加法器可以对多个乘法器的计算结果、第三存储空间中当前第三矩阵的子集中的元素相加,并利用加法运算的结果更新第三存储空间中第三矩阵的子集中的元素。这样,通过PE的上述结构,实现对第一矩阵的子集和第二矩阵的子集相乘,使得该矩阵运算的加速器能够准确、高效的完成矩阵运算。
在另一种可能的实现方式中,矩阵运算的加速器中的PE例如可以包括乘法器、加法器和寄存器,乘法器的第一输入端和第二输入端分别连接存储器的第一存储空间和第二存储空间,乘法器的输出端和寄存器的输出端均连接加法器的输入端,加法器的输出端连接寄存器的输入端,加法器的输出端还连接存储器的第三存储空间。其中,寄存器可以存储第三存储空间中当前的第三矩阵的子集中的元素;乘法器可以对第一矩阵的子集中的元素和第二矩阵的子集中的元素相乘;加法器可以对多个乘法器的计算结果、寄存器中当前的第三矩阵的子集中的元素相加,并利用加法运算的结果更新第三存储空间中第三矩阵的子集中的元素。这样,通过PE的上述结构,实现对第一矩阵的子集和第二矩阵的子集相乘,使得该矩阵运算的加速器能够准确、高效的完成矩阵运算。需要说明的是,该实现方式中的寄存器在PE中仅起到数据缓存的作用,用以减少矩阵运算过程中PE从存储器中存取数据的次数,从而提高矩阵运算的处理效率。
需要说明的是,PE中包括的乘法器的数量和第一矩阵的子集的规模、第二矩阵的子集的规模相关。例如:第一矩阵的子集的规模和第二矩阵的子集的规模均为4×4,那么,PE中可以设置4个乘法器;又例如:第一矩阵的子集的规模和第二矩阵的子集的规模均为8×8,那么,PE中可以设置8个乘法器。
第二方面,本申请还提供了一种矩阵运算的方法,该方法应用于矩阵运算的加速器,矩阵运算的加速器用于执行矩阵运算,该方法具体可以包括:响应于所接收的矩阵运算指令,将第一矩阵的子集和第二矩阵的子集分别存储在存储器的第一存储空间和第二存储空间,将第一矩阵的子集和第二矩阵的子集相乘后获得的子集存储在存储器的第三存储空间,其中,该矩阵运算指令用于指示对第一矩阵和第二矩阵进行矩阵运算,第三存储空间中用于存储基于第一矩阵的子集和第二矩阵的子集相乘后获得的子集组成的第三矩阵;接着,根据矩阵运算指令对第一矩阵的子集和第二矩阵的子集进行矩阵运算,得到矩阵运算的结果。
在一种可能的实现方式中,上述根据矩阵运算指令对第一矩阵的子集和第二矩阵的子集进行矩阵运算,例如可以包括:根据矩阵运算指令,对第一矩阵的子集和第二矩阵的子集进行并行矩阵运算。
在另一种可能的实现方式中,本申请提供的方法还可以包括:基于矩阵运算的结果更新 第三存储空间中第三矩阵的子集,第三矩阵的子集为第一矩阵的子集与第二矩阵的子集进行矩阵运算后获得的。
在另一种可能的实现方式中,本申请实施例提供的方法还可以包括:基于矩阵运算指令,对第一矩阵和第二矩阵进行分块,得到第一矩阵的多个第一子集和第二矩阵的多个第二子集。
在另一种可能的实现方式中,本申请实施例提供的方法还可以包括:根据分块的结果,从共享存储空间中获取第一矩阵的N个第一子集和第二矩阵的N个第二子集,N大于或等于矩阵运算的加速器所包括的处理单元PE的数量,N为正整数,共享存储空间为处理器和矩阵运算的加速器共享的存储空间;那么,上述将第一矩阵的子集和第二矩阵的子集分别存储在存储器的第一存储空间和第二存储空间,例如可以包括:将N个第一子集存入存储器的第一存储空间;将N个第二子集存入存储器的第二存储空间。
在另一种可能的实现方式中,在完成对第一存储空间中的第一子集和第二存储空间中的第二子集的矩阵运算,且未对第一矩阵和第二矩阵中的所有子集完成矩阵运算时,本申请实施例提供的方法还可以包括:从共享存储空间中获取第一矩阵未参与矩阵运算的第一子集,并将所获取的第一矩阵未参与矩阵运算的第一子集存入存储器的第一存储空间。
在另一种可能的实现方式中,在完成对第一存储空间中的第一子集和第二存储空间中的第二子集的矩阵运算,且未对第一矩阵和第二矩阵中的所有子集完成矩阵运算时,本申请实施例提供的方法还可以包括:从共享存储空间中获取第二矩阵未参与矩阵运算的第二子集,并将所获取的第二矩阵未参与矩阵运算的第二子集存入存储器的第二存储空间。
在另一种可能的实现方式中,在对第一矩阵和第二矩阵中的所有子集完成矩阵运算时,本申请实施例提供的方法还可以包括:将第三存储空间中当前保存的第三矩阵从存储器取出,存入共享存储空间中,第三矩阵为第一矩阵和第二矩阵进行矩阵运算得到的矩阵。
在另一种可能的实现方式中,本申请实施例提供的方法还可以包括:向处理器发送中断指令,中断指令用于告知完成对第一矩阵和第二矩阵的矩阵运算。
在另一种可能的实现方式中,实施该方法的矩阵运算的加速器中可以包括:处理单元PE,PE包括乘法器和加法器,其中,乘法器的第一输入端和第二输入端分别连接存储器的第一存储空间和第二存储空间,乘法器的输出端连接加法器的第一输入端,加法器的第二输入端连接存储器的第三存储空间,加法器的输出端连接存储器的第三存储空间。那么,PE中进行矩阵运算的过程可以包括:乘法器对第一矩阵的子集中的元素和第二矩阵的子集中的元素相乘,加法器对多个乘法器的计算结果、第三存储空间中当前第三矩阵的子集中的元素相加,并利用加法运算的结果更新第三存储空间中第三矩阵的子集中的元素。
在另一种可能的实现方式中,实施该方法的矩阵运算的加速器中可以包括:处理单元PE,PE包括乘法器、加法器和寄存器,乘法器的第一输入端和第二输入端分别连接存储器的第一存储空间和第二存储空间,乘法器的输出端和寄存器的输出端均连接加法器的输入端,加法器的输出端连接寄存器的输入端,加法器的输出端还连接存储器的第三存储空间。那么,PE中进行矩阵运算的过程可以包括:寄存器存储第三存储空间中当前的第三矩阵到的子集中的元素;乘法器对第一矩阵的子集中的元素和第二矩阵的子集中的元素相乘;加法器对多个乘法器的计算结果、第三存储空间中当前的第三矩阵的子集中的元素对应相加,并利用加法运算的结果更新第三存储空间中第三矩阵的子集中的元素。
在另一种可能的实现方式中,PE中包括的乘法器的数量和第一矩阵的子集的规模、第二矩阵的子集的规模相关。
需要说明的是,第二方面提供的方法由第一方面提供的矩阵运算的加速器实施,该方法 中的各种可能的实施方式的相关说明和达到的效果,均可以参见上述第一方面中对应的描述。
第三方面,本申请还提供了一种矩阵运算的装置,所述装置包括用于执行第二方面或第二方面任一种可能实现方式中的矩阵运算的方法的各个模块。
第四方面,本申请还提供了一种矩阵运算的设备,矩阵运算的设备包括处理器和存储器;存储器,用于存储计算机指令;处理器,用于根据计算机指令执行如第二方面或第二方面任一种可能实现方式中的矩阵运算的方法的操作步骤。
第五方面,本申请还提供了一种设备,该设备包括处理器、共享存储空间和上述第一方面或第一方面任一种可能实现方式中提供的矩阵运算的加速器,处理器和矩阵运算的加速器共享该共享存储空间,其中:处理器,用于向矩阵运算的加速器发送矩阵运算指令;矩阵运算的加速器,用于基于矩阵运算指令,对共享存储空间中的矩阵执行上述第二方面或第二方面任一种可能实现方式中提供的方法,实现矩阵运算。
第六方面,本申请提供一种计算机可读存储介质,计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述各方面的方法的操作步骤。
第七方面,本申请提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述各方面的方法的操作步骤。
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
附图说明
图1为本申请适用矩阵运算的系统10的逻辑架构示意图;
图2为本申请提供的PE 131执行一次乘加过程所涉及的计算模块的逻辑架构示意图;
图3为本申请提供的一种矩阵运算的方法的流程交互图;
图4为本申请提供的各PE执行一次块乘操作的示意图;
图5为本申请提供的一种矩阵运算的装置的结构示意图;
图6为本申请提供的一种矩阵运算的设备的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。
图1为本申请中适用矩阵运算的系统10的逻辑架构示意图,如图1所示,该系统10包括:矩阵运算的加速器100、处理器200、共享存储空间300和总线400,矩阵运算的加速器100和处理器200通过总线400共享主存300中的存储空间。其中,系统10具体可以是具有矩阵运算功能的设备,例如:系统10为计算设备,具体可以是服务器。矩阵运算的加速器100和处理器200具体可以是两个独立的芯片,也可以是集成在一个芯片内的两个模块,本申请对此不做限定。需要说明的是,处理器200例如可以是中央处理器(central processing unit,CPU)、现场可编程逻辑门阵列(field-programmable gate array,FPGA)、专用集成电路(application specific integrated circuit,ASIC)、图形处理器(graphics processing unit,GPU)等。本申请中以处理器200为CPU 200为例进行说明。需要说明的是,共享存储空间300例如可以是主存或者其他任意能够被处理器200和矩阵运算的加速器100共享的存储空间,本申请中以共享存储空间300为主存300为例进行说明。
矩阵运算,是指对至少两个矩阵进行运算得到一个结果矩阵的过程。矩阵运算作为科学 计算领域的核心问题,在大规模科学计算、大规模工程计算以及数值模拟等科学计算中被广泛使用。为了使得科学计算更加高效,矩阵运算通常被优化为高效的、具有良好移植性的线性代数软件包。由于在科学计算领域,矩阵运算主要包括:矩阵乘法、矩阵求幂、矩阵相除等,而大部分的矩阵运算都可以转换为矩阵乘法,所以,矩阵乘法对应的程序可以视作线性代数软件包的核心,例如:基础线性代数库子程序(basic linear algebra subprograms,BLAS)作为一种常用的线性代数软件包,虽然包括了大量已经编写好的关于矩阵运算的程序,但通用矩阵乘(general matrix multiplication,GEMM)对应的程序是该BLAS的核心。
以矩阵A和矩阵B相乘得到矩阵C为例,对矩阵乘法进行介绍。矩阵A和矩阵B可以相乘的条件为:矩阵A包括的列数和矩阵B包括的行数相同。矩阵C中的每个元素,均为矩阵A中的某行中的各个元素和矩阵B中某列中的各个元素进行对应相乘,并将乘积进行累加后得到的,例如:矩阵C中第i行的第j个元素
Figure PCTCN2021099891-appb-000001
其中,N为矩阵A包括的列数,N也为矩阵B包括的行数,a ik为矩阵A的第i行的第k个元素,b kj为矩阵B的第k行的第j个元素。下文中将该计算矩阵C中的一个元素的过程简称为一次乘加过程。
值得说明的是,在具体实施过程中可以根据执行矩阵运算两个矩阵的规模确定仅执行矩阵乘操作,还是矩阵相乘后执行乘加操作。
图1中,矩阵运算的加速器100用于接收CPU 200发送的矩阵运算指令,并基于该矩阵运算指令对主存300中保存的待运算的矩阵进行矩阵运算。参见图1,矩阵运算的加速器100包括:控制(control,CTRL)单元110、存储器120、处理单元(process element,PE)131、PE 132、PE 133和PE 134。此外,矩阵运算的加速器100还包括直接内存存取(direct memory access,DMA)单元140。其中,CTRL单元110,用于接收到CPU 200发送的矩阵运算指令,并基于该矩阵运算指令,对待进行矩阵运算的第一矩阵和第二矩阵进行分块操作,基于分块结果向DMA单元140发送指令,指示DMA单元140进行数据存取操作。DMA单元140,用于根据CTRL单元110的指示,从主存300中获取第一矩阵的子集并存入存储器120的第一存储空间。CTRL单元110还用于向各个PE发送运算指令,多个PE用于根据CTRL单元110发送的运算指令,从第一存储空间和第二存储空间分别获取第一矩阵的子集和第二矩阵的子集,对第一矩阵的子集和第二矩阵的子集进行矩阵运算得到第三矩阵的子集,并将第三矩阵的子集存储到第三存储空间对应的位置。如此,当完成对第一矩阵的所有子集和第二矩阵的所有子集的矩阵运算后,DMA单元140还用于从存储器120的第三存储空间读取第三矩阵并存入主存300中。其中,多个PE均和存储器120相连,并且多个PE均受CTRL单元110的控制。
以矩阵A×矩阵B=矩阵C的矩阵乘法为例,假设矩阵A为16×4的矩阵,矩阵B为4×16的矩阵,那么,可以对矩阵A和矩阵B进行分块,具体分块的方式可以为:矩阵A从第0行到第3行的元素组成的矩阵记作子集A0,从第4行到第7行的元素组成的矩阵记作子集A1,从第8行到第11行的元素组成的矩阵记作子集A2,从第12行到第15行的元素组成的矩阵记作子集A3;同理,矩阵B从第0列到第3列的元素组成的矩阵记作子集B0,从第4列到第7列的元素组成的矩阵记作子集B1,从第8列到第11列的元素组成的矩阵记作子集B2,从第12列到第15列的元素组成的矩阵记作子集B3。这样,矩阵A可以划分为4个4×4的子集A0-A3,矩阵B可以划分为4个4×4的子集B0-B3。需要说明的是,划分矩阵得到的每个子集必须包括该矩阵中连续的元素,且矩阵中的任意一个元素仅被包含在一个子集中,矩阵中的所有元素均被包含在子集中。那么,存储器120可以被划分为3个存储空间:存储 空间A、存储空间B和存储空间C,其中,存储空间A中用于保存矩阵A的子集,存储空间B中用于保存矩阵B的子集,存储空间C中用于保存矩阵C,在初始状态下(即还未执行矩阵运算时),矩阵C=0,即,存储空间C为空。存储空间A和存储空间B分别包括4个存储块,存储空间C中包括4×4=16个存储块,每个存储块用于存储矩阵中一个子集(也可以称为区域,每个子集中包括原矩阵中一个区域中所有元素),每个存储块包括4×4=16个存储单元,每个存储单元用于存储矩阵的一个元素。其中,子集为矩阵中部分元素的集合,例如,将矩阵划分为多个方阵。
以PE 131为例,PE 131执行矩阵运算的过程包括:S11,计算C00=A0×B0;S12,计算C01=A0×B1;S13,计算C02=A0×B2;S14,计算C03=A0×B3。
由于每个子集均为4×4的规模,所以,对于S11-S14中的每个步骤,PE 131均要执行16次乘加过程。图2为PE 131执行一次乘加过程所涉及的计算模块的逻辑架构示意图。如图2所示,假设以PE 131执行的乘加过程为:基于A0的第一行{a 00,a 01,a 02,a 03}和B0的第一列{b 00,b 10,b 20,b 30}得到C00第一行的第一个元素的过程,那么,该乘加过程所涉及的计算模块可以包括:乘法器1、乘法器2、乘法器3、乘法器4、加法器1、加法器2、加法器3、加法器4、寄存器1和寄存器2。其中,乘法器1~4的一个输入端分别连接存储器120的存储空间A的第一个存储块中存储a 00、a 01、a 02和a 03对应的存储单元,乘法器1~4的另一个输入端分别连接存储器120的存储空间B的第一个存储块中存储b 00、b 10、b 20和b 30对应的存储单元,乘法器1和乘法器2的输出端连接到加法器1的一个输入端,乘法器3和乘法器4的输出端连接到加法器2的另一个输入端,加法器1和加法器2的输出端均连接加法器3的输入端,加法器3的输出端连接寄存器1的输入端,寄存器1的输出端连接加法器4的一个输入端,加法器4的另一个输入端连接寄存器2的输出端,加法器4的输出端连接寄存器2的输入端和存储器120的存储空间C的第一个存储块中存储子集C00第一行的第一个元素对应的存储单元。其中,乘法器和存储空间之间、乘法器和加法器之间、加法器之间、加法器和寄存器之间、加法器和存储空间之间,均可以通过用于导通电信号的连接线进行连接。
作为一个示例,PE 131执行S11的过程中的一次乘加过程可以包括:
S111,乘法器1分别从存储空间A和存储空间B中读取a 00和b 00,并将计算a 00×b 00得到C 0,乘法器2分别从存储空间A和存储空间B中读取a 01和b 10,并将计算a 01×b 10得到C 1,乘法器3分别从存储空间A和存储空间B中读取a 02和b 20,并将计算a 02×b 20得到C 2,乘法器4分别从存储空间A和存储空间B中读取a 03和b 30,并将计算a 03×b 30得到C 3
S112,加法器1计算C 0+C 1=C 12,加法器2计算C 2+C 3=C 23
S113,加法器3计算C 12+C 23=C 123,并将C 123存入寄存器1;
S114,加法器4从寄存器1和寄存器2中分别读取C 123和C 当前(C 当前=0),并计算C 123+C 当前=C 123
S115,加法器4用C 123刷新寄存器2中的C 当前,并将C 123存入存储空间C的第一个存储块中存储子集C00第一行的第一个元素对应的存储单元。
需要说明的是,PE中,乘法器可以是具有乘法功能的任意电路模块,加法器可以是具有加法功能的任意电路模块,无论是乘法器对应的电路模块还是加法器对应的电路模块,其输入输出端的数量均可以基于需要进行灵活设计。作为一个示例,加法器1~3可以被一个包括四输入一输出的加法器替代。
需要说明的是,上述寄存器1和寄存器2在PE 131中仅起到数据缓存的作用,用以提高乘加过程的处理效率。而实际场景中,一种情况下,PE 131中也可以仅包括寄存器2,那么, 加法器3的输出端直接连接加法器4的输入端即可。另一种情况下,PE 131中也可以不包括寄存器,那么,加法器3的输出端直接连接加法器4的输入端,加法器4的另一个输入端连接存储空间C的第一个存储块中存储子集C00第一行的第一个元素对应的存储单元,从中读取该存储单元当前的数据;或者,PE 131中也可以不包括寄存器和加法器4,那么,加法器3的一个输入端连接存储空间C的第一个存储块中存储子集C00第一行的第一个元素对应的存储单元,从中读取该存储单元当前的数据,加法器3的输出端也连接该存储单元,用累加结果刷新该存储单元当前的数据。
应理解,存储器120,具体可以是易失性存储器或非易失性存储器,其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)或闪存等,易失性存储器可以是随机存取存储器(random access memory,RAM)等,本申请对此不做限定。
需要说明的是,图1所示的系统架构仅仅是为了更好的说明本申请所提供的矩阵运算的方法所提供的系统架构的示例,图2所示的PE 131执行一次乘加过程所涉及的计算模块的逻辑架构仅仅是为了更好的说明本申请所提供的矩阵运算的方法所提供的PE结构示例,并不构成对本申请实施例的限定。
基于上述系统架构,本申请提供一种矩阵运算的方法,处理器向矩阵运算的加速器发送矩阵运算指令,指示该矩阵运算的加速器对第一矩阵和第二矩阵进行矩阵运算,那么,由矩阵运算的加速器对两个矩阵进行分块,得到第一矩阵的多个第一子集和第二矩阵的多个第二子集,并将部分或全部的第一子集和第二子集从主存对应加载到该矩阵运算的加速器中存储器的第一存储空间和第二存储空间中,根据矩阵运算指令对第一子集和第二子集进行矩阵运算,并将第一子集和第二子集对应的矩阵运算的结果保存在存储器的第三存储空间,该第三存储空间中最终的数据即为第一矩阵和第二矩阵进行矩阵运算后得到的结果矩阵。可见,该方法利用专门的矩阵运算的加速器进行矩阵运算,一方面,该矩阵运算的加速器内部具有存储器,使得矩阵运算不再受限于处理器中寄存器的资源,减少矩阵运算的加速器与主存之间存取数据的次数,节约存取数据的时间,提高了矩阵运算的效率;另一方面,由矩阵运算的加速器对参与运算的矩阵进行计算,使得矩阵运算不再受限于处理器的计算能力,能够在较短的时间内完成大规模矩阵的运算,实现高效的矩阵运算。
本申请实施例中,存储器120被划分为若干个存储空间,每个存储空间用于保存矩阵运算中的一个矩阵的全部或部分数据。一个存储空间被划分为若干个存储块,每个存储块用于包括矩阵分块后的一个子集。一个存储块被划分为若干个存储单元,每个存储单元用于存储矩阵的一个元素。
接下来,以图1所示的系统10为例,结合图3详细介绍本申请所提供的矩阵运算的方法,如图3所示,该方法包括:
S301,CPU 200向矩阵运算的加速器100的CTRL单元110发送矩阵运算指令,该矩阵运算指令用于指示对第一矩阵和第二矩阵进行矩阵运算。
具体实现时,S301中的矩阵运算指令具体可以是CPU 200写入主存300的程序空间的程序代码,CTRL单元110从主存300的该程序空间获取该程序代码并进行译码,获得该矩阵运算指令。
矩阵运算指令,用于指示矩阵运算的加速器100进行第一矩阵和第二矩阵之间的矩阵运算。为了使得矩阵运算可以准确的实施,该矩阵运算指令还可以指示参与矩阵运算的矩阵的相关信息,如:每个参与矩阵运算的矩阵的起始地址以及矩阵规模。作为一个示例,矩阵运 算指令具体可以包括:指示信息1、第一矩阵的起始地址1、第一矩阵的规模1、第二矩阵的起始地址2和第二矩阵的规模2,其中,指示信息1用于指示执行第一矩阵和第二矩阵的矩阵乘法,第一矩阵的规模1可以是16×4,第二矩阵的规模2可以是4×16,起始地址1为第一矩阵在主存300的数据空间中存储矩阵A的起始地址,起始地址2为第二矩阵在主存300的数据空间中存储矩阵B的起始地址。
需要说明的是,主存300包括数据空间和程序空间,其中,数据空间用于存储操作数,程序空间用于存储各种指令对应的程序代码。在系统10中,主存300可以为矩阵运算的加速器100预留一部分程序空间,CPU 200可以在该预留的程序空间写入矩阵运算指令对应的程序代码,以指示矩阵运算的加速器100基于该矩阵运算指令进行相应的矩阵运算。
S302,CTRL单元110基于矩阵运算指令,对第一矩阵和第二矩阵进行分块,得到第一矩阵的多个第一子集和第二矩阵的多个第二子集。
CTRL单元110获取到矩阵运算指令后,即可确定需要对第一矩阵和第二矩阵进行矩阵乘法运算。为了使得能够充分利用矩阵运算的加速器100中的资源,实现高效的矩阵运算,CTRL单元110对参与矩阵运算的两个矩阵进行分块处理。分块处理后得到的每个块称为一个子集,每个子集中包括至少一个元素。
对矩阵进行分块处理,具体是将矩阵中至少一个连续行或连续列中的若干元素划分为一个子集。划分矩阵所得的每个子集必须包括该矩阵中连续的元素,矩阵中的任意一个元素仅能够被包含在一个子集中,且矩阵中的所有元素均需要被包含在一个子集。
对矩阵进行划分,还需要确保分块后的第一矩阵的子集和第二矩阵的子集是可乘的,可乘具体指:第一矩阵的子集包括的列数和第二矩阵的子集包括的行数相同。
需要说明的是,矩阵划分得到的子集,可以规模相同也可以不同,只要确保划分后两个矩阵的子集可乘即可。
作为一个示例,对于将矩阵划分为规则相同的若干个矩阵的实施方式,可能存在剩余部分元素无法组成该规模子集的情况,那么,可以将剩余的元素通过补零的方式,继续划分为至少一个该规模的子集,进行矩阵运算的过程不受该补零操作的影响。
本申请实施例中,以将矩阵划分为方阵(即,每个子集为方阵)且两个参与矩阵运算的矩阵划分的子集的规模相同为例进行描述。
例如:第一矩阵的规模为16×4,第二矩阵的规模为4×16,那么,CTRL单元110对第一矩阵和第二矩阵的分块方式可以包括:方式1、子集为1×1的方阵,那么,分块后得到64个第一子集和64个第二子集,每个子集包括1个元素;方式2、子集为2×2的方阵,那么,分块后得到16个第一子集和16个第二子集,每个子集包括连续的4个元素;方式3、子集为4×4的方阵,那么,分块后得到4个第一子集和4个第二子集,每个子集包括连续的16个元素。
S303,CTRL单元110向DMA单元140发送第一命令,第一命令用于指示DMA单元140获取第一矩阵的第一子集和第二矩阵的第二子集。
S304,DMA单元140从主存300获取第一矩阵的第一子集和第二矩阵的第二子集。
S305,DMA单元140将第一矩阵的第一子集和第二矩阵的第二子集分别存入存储器120的第一存储空间和第二存储空间。
CTRL单元110可以基于分块结果,以及存储器120的资源,生成并向DMA单元140发送第一命令,指示该DMA单元140将N个第一子集和N个第二子集从主存300中搬移到存储器120中,其中,N为大于或等于矩阵运算的加速器100所包括的PE的数量的整数,对应该系统10,N≥4。通常,为了读取和运算能够有序进行,N的取值为矩阵运算的加速器100所包括的PE的 数量的整数倍。例如:假设存储器120的资源足够大,能够一次性容纳一个16×4的矩阵、一个4×16的矩阵和一个16×16的矩阵,那么,如果第一子集和第二子集的规模为1×1,则,N可以取4n(n取1~16的整数);如果第一子集和第二子集的规模为2×2,则,N可以取4m(m取1~4的整数);如果第一子集和第二子集的规模为4×4,则,N可以取4。
其中,存储器120将其存储区域划分为多个存储空间,每个存储空间用于存储一个矩阵的数据。例如:对第一矩阵和第二矩阵进行矩阵运算,那么,该存储器120将存储区域划分为3个存储空间:第一存储空间、第二存储空间和第三存储空间,其中,第一存储空间用于存储DMA单元140搬移来的第一矩阵的部分或全部第一子集,第二存储空间用于存储DMA单元140搬移来的第二矩阵的部分或全部第二子集,第三存储空间用于存储各PE进行矩阵运算后得到的中间结果或最终结果(即,第三矩阵),初始状态下(即未进行矩阵运算时),第三存储空间为空。
当DMA单元140接收到第一命令后,即可基于第一命令,从主存300中获取全部或部分第一子集和第二子集,并将所获取的第一子集和第二子集分别存入存储器120的第一存储空间和第二存储空间。例如:参见图4,如果第一矩阵分块后得到A0~A3共4个第一子集,第二矩阵分块后得到B0~B3共4个第二子集,那么,经过S303~S305,存储器120的第一存储空间包括:A0~A3,第二存储空间包括:B0~B3。A0~A3以及B0~B3均为4×4的方阵。
S306,CTRL单元110向各PE发送第二命令,第二命令用于指示各PE进行相应的矩阵运算。
S307,各PE基于第二命令,分别从存储器的第一存储空间和第二存储空间获取第一矩阵的第一子集和第二矩阵的第二子集。
S308,各PE基于第二命令,并行的对所获取的第一子集和第二子集进行矩阵运算得到第三子集,并将第三子集存入存储器120的第三存储空间。
各PE可以基于CTRL单元110发送的第二命令,确定其负责进行矩阵乘法运算的存储块,并对所确定的存储块中的子集进行矩阵乘法运算。需要说明的是,由于各PE执行的矩阵运算可以是并行的,各PE进行并行矩阵运算所实施操作的流程均相同,所以,图3中仅示出PE 131在矩阵运算中的交互流程,以PE 131的矩阵运算操作为例说明该矩阵运算中各PE执行的并行操作。
其中,S308中PE对所获取的第一子集和第二子集进行矩阵运算得到第三子集,并将第三子集存入存储器120的第三存储空间,可以是对第一子集和第二子集进行块乘操作,并将块乘结果作为第一子集和第二子集进行矩阵乘法运算后对应的第三子集,存入第三存储空间中对应的位置。例如:A0和B0分别作为第一子集和第二子集进行矩阵乘法运算后,得到的第三子集C00,并将C00存入第三存储空间的第一个存储块中。需要说明的是,一次块乘操作包括至少一次乘加操作,例如:如果第一子集和第二子集均为4×4的方阵,那么,一次块乘操作包括了4×4=16次的乘加操作。又例如:如果第一子集和第二子集均为2×2的方阵,那么,一次块乘操作包括了2×2=4次的乘加操作。PE进行的乘加操作参见上述图2对应的说明。
举例来说,参见图4,存储器120的三个存储空间均划分为4个存储块,每个存储块存储一个子集,以第一存储空间包括A0-A3,第二存储空间包括B0-B3为例,第一存储空间的存储块0-存储块3分别保存A0-A3,第二存储空间的存储块4-存储块7分别保存B0-B3,第三存储空间的存储块8-存储块23分别保存C00、C01、C02、C03、C10、C11、C12、C13、C20、C21、C22、C23、C30、C31、C32和C33,初始状态下,C00~C33均等于0,即,存储块8~存储块23均为空。
作为一个示例,PE 131对应存储块0和存储块8-存储块11、PE 132对应存储块1和存储块 12~存储块15、PE 133对应存储块2和存储块16-存储块19、PE 134对应存储块3和存储块20~存储块23。
以PE 131为例,由于第二存储空间包括B0~B3,所以需要执行4次块乘操作,每个块乘操作对应第二存储空间的一个存储块。PE 131执行矩阵运算的过程可以包括:
S21,PE 131从存储块0中获取A0,从存储块4中获取B0,计算A0×B0,得到C00,并将C00存入存储器120的第三存储空间的存储块8;
S22,PE 131从存储块5中获取B1,计算A0×B1,得到C01,并将C01存入存储器120的第三存储空间的存储块9;
S23,PE 131从存储块6中获取B2,计算A0×B2,得到C02,并将C02存入存储器120的第三存储空间的存储块10;
S24,PE 131从存储块7中获取B3,计算A0×B3,得到C03,并将C03存入存储器120的第三存储空间的存储块11。其中,S21~S24中的每个步骤代表PE 131执行的一次块乘操作的过程。图4中示出了PE1 131执行S21对应的块乘操作的过程,以及PE 132从存储块1中获取A1,从存储块5中获取B1,计算A1×B1,得到C11,并将C11存入存储器120的第三存储空间的存储块13的过程;PE 133从存储块2中获取A2,从存储块6中获取B2,计算A2×B2,得到C22,并将C22存入存储器120的第三存储空间的存储块18的过程;PE 134从存储块3中获取A3,从存储块7中获取B3,计算A3×B3,得到C33,并将C33存入存储器120的第三存储空间的存储块23的过程。
对于S308,各PE进行并行的矩阵运算,具体是指各PE从第一存储空间对应的存储块中获取第一子集后,从第二存储空间的各存储块中依次获取各第二子集,利用第一子集分别和获取的第二子集执行块乘操作后,将得到的第三子集存入该PE在第三存储空间对应的存储块中。其中,每个PE执行块乘操作的次数可以等于S308中参与矩阵运算的第二子集的个数。第三存储空间中保存有若干个第三子集。例如,对于上述举例中矩阵A乘矩阵B的运算,执行S308之后,第三子集的数量等于实施并行矩阵运算的第一子集的个数和第二子集的个数的乘积;又例如,假设对于矩阵B乘矩阵A的运算,执行S308之后,第三子集的数量等于1,每个PE将其计算所得的第三子集累加到第三存储空间当前的数据上,得到最终的矩阵C为4个4×4的方阵。
需要说明的是,各个PE进行并行矩阵运算的过程中,每个PE实施的操作为独立操作,不受其他PE的影响,各个PE进行矩阵运算的速度也不影响其他PE。
通过多个PE对两个矩阵的不同子集进行并行矩阵运算,能够有效的提高矩阵运算的速率。
S309,CTRL单元110判断是否完成对第一矩阵和第二矩阵的矩阵运算,如果否,则,执行下述S310,如果是,执行S311。
S310,CTRL单元110向DMA单元140发送第三命令,第三命令用于指示DMA单元140获取未载入的第一矩阵的第一子集或第二矩阵的第二子集,并返回执行S304。
S311,CTRL单元110通过DMA单元140向主存300写入第三矩阵,该第三矩阵为第一矩阵和第二矩阵进行矩阵运算得到的结果矩阵。
CTRL单元110在每个执行完上述S308之后,均会确认是否还有未参与矩阵运算的第一子集或第二子集,若存在,且确定未完成第一矩阵和第二矩阵的矩阵运算,执行S310,以继续进行未完成的矩阵运算过程;如果确认不存在未参与矩阵运算的第一子集和第二子集,则,确定已经完成第一矩阵和第二矩阵的矩阵运算,可以执行下述S311,以通过DMA单元140 向主存300写入第三矩阵,该第三矩阵为第一矩阵和第二矩阵进行矩阵运算得到的结果矩阵。
在一些可能的实现方式中,当确定未完成第一矩阵和第二矩阵的矩阵运算,则,CTRL单元110向DMA单元140发送第三命令,指示DMA单元140继续从主存300中获取未载入的第一矩阵的第一子集或第二矩阵的第二子集,并返回执行S304-S308,直到完成矩阵运算。
例如:假设第一矩阵的规模为16×8,第二矩阵的规模为8×16,将两个矩阵分块后得到的A0-A7和B0-B7,其中,A0-A3为第一列第一子集,A4-A7为第二列第一子集,B0-B3为第一行第二子集,B4~B7为第二行第二子集。上述S304-S308对第一列第一子集和第一行第二子集进行了并行矩阵运算,得到第三子集C00-C33。经过S309确定还未完成第一矩阵和第二矩阵之间的矩阵运算,则,可以三次执行S310和S304-S308,并将得到的第三子集累加当前第三存储空间对应的存储块上,得到新的第三子集,三次累加之后得到的所有第三子集的集合记作第三矩阵。
以PE 131进行矩阵运算的过程为例,具体可以包括:
S31,DMA单元140基于第三命令,将主存300中的第二子集B4-B7搬移到第二存储空间中,CTRL单元110向各PE发送第二命令,第二命令用于指示各PE进行相应的矩阵运算;
S32,PE 131基于第二命令,从存储器的第二存储空间获取第二矩阵的第二子集B4-B7;
S33,PE 131计算A0×B4得到C00’,并将C00’累加到第三存储空间的存储块8,计算A0×B5得到C01’,并将C01’累加到第三存储空间的存储块9,计算A0×B6得到C02’,并将C02’累加到第三存储空间的存储块10,计算A0×B7得到C03’,并将C03’累加到第三存储空间的存储块11。然后,经过S309的判断,该矩阵运算过程还可以包括:
S34,DMA单元140基于第三命令,将主存300中的第一子集A4~A7搬移到第一存储空间中,CTRL单元110向各PE发送第二命令,第二命令用于指示各PE进行相应的矩阵运算;
S35,PE 131基于第二命令,从存储器的第一存储空间获取第一矩阵的第一子集A4~A7;
S36,PE 131计算A4×B4得到C00”,并将C00”累加到第三存储空间的存储块8,计算A4×B5得到C01”,并将C01”累加到第三存储空间的存储块9,计算A4×B6得到C02”,并将C02”累加到第三存储空间的存储块10,计算A4×B7得到C03”,并将C03”累加到第三存储空间的存储块11。接着,经过S309的判断,该矩阵运算过程还可以包括:
S37,DMA单元140基于第三命令,将主存300中的第二子集B0~B3搬移到第二存储空间中,CTRL单元110向各PE发送第二命令,第二命令用于指示各PE进行相应的矩阵运算;
S38,PE 131基于第二命令,从存储器的第二存储空间获取第二矩阵的第二子集B0~B3;
S39,PE 131计算A4×B0得到C00”’,并将C00”’累加到第三存储空间的存储块8,计算A4×B1得到C01”’,并将C01”’累加到第三存储空间的存储块9,计算A4×B2得到C02”’,并将C02”’累加到第三存储空间的存储块10,计算A4×B3得到C03”’,并将C03”’累加到第三存储空间的存储块11。
如此,PE 131完成了第一矩阵和第二矩阵的矩阵运算,得到第三矩阵的第一行的四个第三子集,四个第三子集分别记作C00、C01、C02和C03,其中,
C00=A0×B0+A0×B4+A4×B4+A4×B0,C01=A0×B1+A0×B5+A4×B5+A4×B1,
C02=A0×B2+A0×B6+A4×B6+A4×B2,C03=A0×B3+A0×B7+A4×B7+A4×B3。
需要说明的是,其他PE执行矩阵运算的过程与上述PE 131的矩阵过程类似,相关描述可以参见上述对PE 131的矩阵运算过程的描述,这里不再赘述。
在另一些可能的实现方式中,当确定已经完成第一矩阵和第二矩阵的矩阵运算,则,CTRL单元110向DMA单元140发送第四命令,指示DMA单元140将得到的第三矩阵存入主存 300中。具体而言,DMA单元140接收到CTRL单元110发送的第四命令后,从存储器120的第三存储空间中获取第三矩阵,并将第三矩阵存入主存300。此外,当确定已经完成第一矩阵和第二矩阵的矩阵运算时,CTRL单元110还可以向CPU 200发送中断指令,该中断指令用于让CPU 200知晓矩阵运算的加速器100已经完成矩阵运算指令所指示的第一矩阵和第二矩阵的运算。
可见,通过本申请实施例提供的方法,由矩阵运算的加速器基于处理器的指示,对主存中的参与运算的矩阵进行分块,得到参与运算的矩阵的多个子集,并将部分或全部的子集从共享存储空间分别加载到该矩阵运算的加速器的存储器的不同存储空间中,根据处理器所发送的矩阵运算指令对不同存储空间的子集进行并行矩阵运算,并将运算所得的结果保存在存储器的另一个存储空间,该另一个存储空间中最终的数据即为第一矩阵和第二矩阵进行矩阵运算后得到的结果矩阵。如此,通过利用专门的矩阵运算的加速器进行矩阵运算,该矩阵运算的加速器内部具有存储器,使得矩阵运算不再受限于处理器中寄存器的资源,减少矩阵运算的加速器与主存之间存取数据的次数,节约存取数据的时间,提高了矩阵运算的效率;而且,由矩阵运算的加速器对参与运算的矩阵进行并行计算,使得矩阵运算不再受限于处理器的计算能力,能够在较短的时间内完成大规模矩阵的运算,实现高效的矩阵运算。
上文中结合图1和图2详细描述了本申请所提供的矩阵运算的加速器,结合图3和图4详细描述了本申请所提供的矩阵运算的方法,下面将结合图5至图6,描述根据本申请所提供的矩阵运算的装置和设备。
图5为本申请提供的一种矩阵运算的装置500,所述矩阵运算的装置500应用于矩阵运算的加速器,所述矩阵运算的装置500包括:接收单元501、存储单元502和运算单元503;
接收单元501,用于接收矩阵运算指令,矩阵运算指令用于指示对第一矩阵和第二矩阵进行矩阵运算;
存储单元502,用于将第一矩阵的子集和第二矩阵的子集分别存储在存储器的第一存储空间和第二存储空间,将第三矩阵存储在存储器的第三存储空间,第三矩阵为基于第一矩阵的子集和第二矩阵的子集相乘后获得的子集组成的矩阵;
运算单元503,用于根据矩阵运算指令对第一矩阵的子集和第二矩阵的子集进行矩阵运算,得到矩阵运算的结果。
可选地,运算单元503,具体用于根据矩阵运算指令对第一矩阵的子集和第二矩阵的子集进行并行矩阵运算,得到矩阵运算的结果。
可选地,该矩阵运算的装置500还可以包括:更新单元;
更新单元,用于基于矩阵运算的结果更新第三存储空间中第三矩阵的子集,其中,第三矩阵的子集为第一矩阵的子集与第二矩阵的子集进行矩阵运算后获得的。
可选地,该矩阵运算的装置500还可以包括:分块单元;
分块单元,用于基于矩阵运算指令,对第一矩阵和第二矩阵进行分块,得到第一矩阵的多个第一子集和第二矩阵的多个第二子集。
可选地,该矩阵运算的装置500还可以包括:数据存取单元;
数据存取单元,用于根据分块的结果,从共享存储空间中获取第一矩阵的N个第一子集和第二矩阵的N个第二子集,N大于或等于矩阵运算的加速器所包括的处理单元PE的数量,N为正整数,共享存储空间为处理器和矩阵运算的加速器共享的存储空间;
那么,上述存储单元502,具体用于:将N个第一子集存入存储器的第一存储空间;将N个第二子集存入存储器的第二存储空间。
可选地,数据存取单元,还用于在完成对第一存储空间中的第一子集和第二存储空间中的第二子集的矩阵运算,且未对第一矩阵和第二矩阵中的所有子集完成矩阵运算时,从共享存储空间中获取第一矩阵未参与矩阵运算的第一子集,并将所获取的第一矩阵未参与矩阵运算的第一子集存入存储器的第一存储空间。
可选地,数据存取单元,还用于在完成对第一存储空间中的第一子集和第二存储空间中的第二子集的矩阵运算,且未对第一矩阵和第二矩阵中的所有子集完成矩阵运算时,从共享存储空间中获取第二矩阵未参与矩阵运算的第二子集,并将所获取的第二矩阵未参与矩阵运算的第二子集存入存储器的第二存储空间。
可选地,数据存取单元,还用于在对第一矩阵和第二矩阵中的所有子集完成矩阵运算时,将第三存储空间中当前保存的第三矩阵从存储器取出,存入共享存储空间中,第三矩阵为第一矩阵和第二矩阵进行矩阵运算得到的矩阵。
可选地,该矩阵运算的装置500还可以包括:发送单元;
发送单元,用于向处理器发送中断指令,中断指令用于告知完成对第一矩阵和第二矩阵的矩阵运算。
可选地,该矩阵运算的装置所应用的矩阵运算的加速器中可以包括:处理单元PE,PE包括乘法器和加法器,其中,乘法器的第一输入端和第二输入端分别连接存储器的第一存储空间和第二存储空间,乘法器的输出端连接加法器的第一输入端,加法器的第二输入端连接存储器的第三存储空间,加法器的输出端连接存储器的第三存储空间。那么,PE中进行矩阵运算的过程可以包括:乘法器对第一矩阵的子集中的元素和第二矩阵的子集中的元素相乘,加法器对多个乘法器的计算结果、第三存储空间中当前第三矩阵的子集中的元素相加,并利用加法运算的结果更新第三存储空间中第三矩阵的子集中的元素。
可选地,该矩阵运算的装置所应用的矩阵运算的加速器中可以包括:处理单元PE,PE包括乘法器、加法器和寄存器,乘法器的第一输入端和第二输入端分别连接存储器的第一存储空间和第二存储空间,乘法器的输出端和寄存器的输出端均连接加法器的输入端,加法器的输出端连接寄存器的输入端,加法器的输出端还连接存储器的第三存储空间。那么,PE中进行矩阵运算的过程可以包括:寄存器存储第三存储空间中当前的第三矩阵到的子集中的元素;乘法器对第一矩阵的子集中的元素和第二矩阵的子集中的元素相乘;加法器对多个乘法器的计算结果、第三存储空间中当前的第三矩阵的子集中的元素对应相加,并利用加法运算的结果更新第三存储空间中第三矩阵的子集中的元素。
可选地,PE中包括的乘法器的数量和第一矩阵的子集的规模、第二矩阵的子集的规模相关。
应理解的是,本申请实施例的装置500可以通过专用集成电路(application-specific integrated circuit,ASIC)实现,或可编程逻辑器件(programmable logic device,PLD)实现,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD),现场可编程门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。也可以通过软件实现图3所示的矩阵运算的方法时,装置500及其各个模块也可以为软件模块。
根据本申请实施例的矩阵运算的装置500可对应于执行本申请实施例中描述的方法,并且矩阵运算的装置500中的各个单元的上述和其它操作和/或功能分别为了实现图3中的方法的相应流程,为了简洁,在此不再赘述。
图6为本申请提供的一种矩阵运算的设备600的示意图,如图所示,所述矩阵运算的设 备600包括处理器601、存储器602、通信接口603和内存单元604。其中,处理器601、存储器602、通信接口603、内存单元604通过总线605进行通信,也可以通过无线传输等其他手段实现通信。该存储器602用于存储指令,该处理器601用于执行该存储器602存储的指令。该存储器602存储程序代码,且处理器601可以调用存储器602中存储的程序代码执行以下操作:
接收矩阵运算指令,所述矩阵运算指令用于指示对第一矩阵和第二矩阵进行矩阵运算;
将所述第一矩阵的子集和所述第二矩阵的子集分别存储在存储器的第一存储空间和第二存储空间,将第三矩阵存储在所述存储器的第三存储空间,所述第三矩阵为基于所述第一矩阵的子集和所述第二矩阵的子集相乘后获得的子集组成的矩阵;
根据所述矩阵运算指令对所述第一矩阵的子集和所述第二矩阵的子集进行矩阵运算,得到矩阵运算的结果。
应理解,在本申请实施例中,该处理器601可以是CPU,该处理器601还可以是其他通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者是任何常规的处理器等。
该存储器602可以包括只读存储器和随机存取存储器,并向处理器601提供指令和数据。存储器602还可以包括非易失性随机存取存储器。例如,存储器602还可以存储设备类型的信息。
该存储器602可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data date SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。
该总线605除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都标为总线605。
应理解,根据本申请实施例的矩阵运算的设备600可对应于本申请实施例中的矩阵运算的装置500,并可以对应于执行根据本申请实施例中图3所示方法中的相应主体,并且矩阵运算的设备600中的各个模块的上述和其它操作和/或功能分别为了实现图3中的各个方法的相应流程,为了简洁,在此不再赘述。
作为另一种可能的实施例,本申请还提供一种设备,该设备包括处理器、共享存储空间和上述如图1中所示的矩阵运算的加速器,处理器和矩阵运算的加速器共享该共享存储空间,其中:处理器,用于向矩阵运算的加速器发送矩阵运算指令;矩阵运算的加速器,用于基于矩阵运算指令,对共享存储空间中的矩阵执行上述如图3所示的方法的各个操作步骤,实现矩阵运算,为了简洁在此不再赘述。
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序 产品包括一个或多个计算机指令。在计算机上加载或执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可以是固态硬盘(solid state drive,SSD)。
以上所述,仅为本申请的具体实施方式。熟悉本技术领域的技术人员根据本申请提供的具体实施方式,可想到变化或替换,都应涵盖在本申请的保护范围之内。

Claims (25)

  1. 一种矩阵运算的加速器,其特征在于,所述加速器包括:控制CTRL单元、存储器和处理单元PE;
    所述CTRL单元,用于接收矩阵运算指令;
    所述存储器,用于在第一存储空间存储第一矩阵的子集、在第二存储空间存储第二矩阵的子集,以及在第三存储空间存储第三矩阵,所述第三矩阵为基于所述第一矩阵的子集和所述第二矩阵的子集相乘后获得的子集组成的矩阵;
    所述PE,用于基于所述矩阵运算指令对所述第一存储空间中所述第一矩阵的子集和所述第二存储空间中所述第二矩阵的子集进行矩阵运算,得到矩阵运算的结果。
  2. 根据权利要求1所述的加速器,其特征在于,所述加速器包括至少一个PE。
  3. 根据权利要求2所述的加速器,其特征在于,当所述至少一个PE为多个PE时,所述多个PE,分别用于基于所述矩阵运算指令对所述第一存储空间中所述第一矩阵的子集和所述第二存储空间中所述第二矩阵的子集进行并行矩阵运算,得到所述矩阵运算的结果。
  4. 根据权利要求1-3任一项所述的加速器,其特征在于,
    所述PE,还用于基于所述矩阵运算的结果更新所述第三存储空间中所述第三矩阵的子集,所述第三矩阵的子集为所述第一矩阵的子集与所述第二矩阵的子集进行矩阵运算后获得的。
  5. 根据权利要求1-4任一项所述的加速器,其特征在于,
    所述CTRL单元,还用于基于所述矩阵运算指令,对所述第一矩阵和所述第二矩阵进行分块,得到所述第一矩阵的多个子集和所述第二矩阵的多个子集。
  6. 根据权利要求5所述的加速器,其特征在于,所述加速器还包括直接内存存取DMA单元;
    所述DMA单元,用于根据所述CTRL单元分块的结果,从共享存储空间中获取所述第一矩阵的N个第一子集和所述第二矩阵的N个第二子集,并将N个所述第一子集和N个所述第二子集分别存入所述存储器的所述第一存储空间和所述第二存储空间,所述N大于或等于所述加速器包括的PE的数量,所述N为正整数,所述共享存储空间为处理器和所述加速器共享的存储空间。
  7. 根据权利要求6所述的加速器,其特征在于,
    所述DMA单元,还用于在所述PE完成对所述第一存储空间中的所述第一子集和所述第二存储空间中的所述第二子集的矩阵运算,且未对所述第一矩阵和所述第二矩阵中的所有子集完成矩阵运算时,从所述共享存储空间中获取所述第一矩阵未参与矩阵运算的第一子集,并将所获取的所述第一矩阵未参与矩阵运算的第一子集存入所述存储器的所述第一存储空间。
  8. 根据权利要求6或7所述的加速器,其特征在于,
    所述DMA单元,还用于在所述PE完成对所述第一存储空间中的所述第一子集和所述第二存储空间中的所述第二子集的矩阵运算,且未对所述第一矩阵和所述第二矩阵中的所有子集完成矩阵运算时,从所述共享存储空间中获取所述第二矩阵未参与矩阵运算的第二子集,并将所获取的所述第二矩阵未参与矩阵运算的第二子集存入所述存储器的所述第二存储空间。
  9. 根据权利要求6所述的加速器,其特征在于,
    所述DMA单元,还用于在所述PE对所述第一矩阵和所述第二矩阵中的所有子集完成矩 阵运算时,将所述第三存储空间中当前保存的所述第三矩阵从所述存储器取出,存入所述共享存储空间中,所述第三矩阵为所述第一矩阵和所述第二矩阵进行矩阵运算得到的矩阵。
  10. 根据权利要求9所述的加速器,其特征在于,
    所述CTRL单元,还用于向处理器发送中断指令,所述中断指令用于告知完成对所述第一矩阵和所述第二矩阵的矩阵运算。
  11. 根据权利要求1-10任一项所述的加速器,其特征在于,所述PE包括:乘法器和加法器,所述乘法器的第一输入端和第二输入端分别连接所述存储器的所述第一存储空间和所述第二存储空间,所述乘法器的输出端连接所述加法器的第一输入端,所述加法器的第二输入端连接所述存储器的所述第三存储空间,所述加法器的输出端连接所述存储器的所述第三存储空间;
    所述乘法器,用于对所述第一矩阵的子集中的元素和所述第二矩阵的子集中的元素相乘;
    所述加法器,用于对多个所述乘法器的计算结果、所述第三存储空间中当前所述第三矩阵的子集中的元素相加,并利用加法运算的结果更新所述第三存储空间中所述第三矩阵的子集中的元素。
  12. 根据权利要求1-10任一项所述的加速器,其特征在于,所述PE包括:乘法器、加法器和寄存器,所述乘法器的第一输入端和第二输入端分别连接所述存储器的所述第一存储空间和所述第二存储空间,所述乘法器的输出端和所述寄存器的输出端均连接所述加法器的输入端,所述加法器的输出端连接所述寄存器的输入端,所述加法器的输出端还连接所述存储器的所述第三存储空间;
    所述寄存器,用于存储所述第三存储空间中当前的所述第三矩阵的子集中的元素;
    所述乘法器,用于对所述第一矩阵的子集中的元素和所述第二矩阵的子集中的元素相乘;
    所述加法器,用于对多个所述乘法器的计算结果、所述寄存器中当前的所述第三矩阵的子集中的元素相加,并利用加法运算的结果更新所述第三存储空间中所述第三矩阵的子集中的元素。
  13. 根据权利要求11或12所述的加速器,其特征在于,所述PE中包括的乘法器的数量和所述第一矩阵的子集的规模、所述第二矩阵的子集的规模相关。
  14. 一种矩阵运算的方法,其特征在于,所述方法应用于矩阵运算的加速器,所述方法包括:
    接收矩阵运算指令,所述矩阵运算指令用于指示对第一矩阵和第二矩阵进行矩阵运算;
    将所述第一矩阵的子集和所述第二矩阵的子集分别存储在存储器的第一存储空间和第二存储空间,将第三矩阵存储在所述存储器的第三存储空间,所述第三矩阵为基于所述第一矩阵的子集和所述第二矩阵的子集相乘后获得的子集组成的矩阵;
    根据所述矩阵运算指令对所述第一矩阵的子集和所述第二矩阵的子集进行矩阵运算,得到矩阵运算的结果。
  15. 根据权利要求14所述的方法,其特征在于,所述根据所述矩阵运算指令对所述第一矩阵的子集和所述第二矩阵的子集进行矩阵运算,包括:
    根据所述矩阵运算指令,对所述第一矩阵的子集和所述第二矩阵的子集进行并行矩阵运算。
  16. 根据权利要求14或15所述的方法,其特征在于,所述方法还包括:
    基于所述矩阵运算的结果更新所述第三存储空间中所述第三矩阵的子集,所述第三矩阵的子集为所述第一矩阵的子集与所述第二矩阵的子集进行矩阵运算后获得的。
  17. 根据权利要求14-16任一项所述的方法,其特征在于,所述方法还包括:
    基于所述矩阵运算指令,对所述第一矩阵和所述第二矩阵进行分块,得到所述第一矩阵的多个第一子集和所述第二矩阵的多个第二子集。
  18. 根据权利要求17所述的方法,其特征在于,所述方法还包括:
    根据分块的结果,从共享存储空间中获取所述第一矩阵的N个所述第一子集和所述第二矩阵的N个所述第二子集,所述N大于或等于所述矩阵运算的加速器所包括的处理单元PE的数量,所述N为正整数,所述共享存储空间为处理器和所述矩阵运算的加速器共享的存储空间;
    所述将所述第一矩阵的子集和所述第二矩阵的子集分别存储在存储器的第一存储空间和第二存储空间,包括:
    将N个所述第一子集存入所述存储器的所述第一存储空间;
    将N个所述第二子集存入所述存储器的所述第二存储空间。
  19. 根据权利要求18所述的方法,其特征在于,在完成对所述第一存储空间中的所述第一子集和所述第二存储空间中的所述第二子集的矩阵运算,且未对所述第一矩阵和所述第二矩阵中的所有子集完成矩阵运算时,所述方法还包括:
    从所述共享存储空间中获取所述第一矩阵未参与矩阵运算的第一子集,并将所获取的所述第一矩阵未参与矩阵运算的第一子集存入所述存储器的所述第一存储空间。
  20. 根据权利要求18或19所述的方法,其特征在于,在完成对所述第一存储空间中的所述第一子集和所述第二存储空间中的所述第二子集的矩阵运算,且未对所述第一矩阵和所述第二矩阵中的所有子集完成矩阵运算时,所述方法还包括:
    从所述共享存储空间中获取所述第二矩阵未参与矩阵运算的第二子集,并将所获取的所述第二矩阵未参与矩阵运算的第二子集存入所述存储器的所述第二存储空间。
  21. 根据权利要求18所述的方法,其特征在于,在对所述第一矩阵和所述第二矩阵中的所有子集完成矩阵运算时,所述方法还包括:
    将所述第三存储空间中当前保存的所述第三矩阵从所述存储器取出,存入所述共享存储空间中,所述第三矩阵为所述第一矩阵和所述第二矩阵进行矩阵运算得到的矩阵。
  22. 根据权利要求21所述的方法,其特征在于,所述方法还包括:
    向处理器发送中断指令,所述中断指令用于告知完成对所述第一矩阵和所述第二矩阵的矩阵运算。
  23. 根据权利要求14-22任一项所述的方法,其特征在于,所述矩阵运算的加速器中包括:处理单元PE,所述PE包括乘法器和加法器,其中,所述乘法器的第一输入端和第二输入端分别连接所述存储器的所述第一存储空间和所述第二存储空间,所述乘法器的输出端连接所述加法器的第一输入端,所述加法器的第二输入端连接所述存储器的所述第三存储空间,所述加法器的输出端连接所述存储器的所述第三存储空间;
    其中,所述乘法器对所述第一矩阵的子集中的元素和所述第二矩阵的子集中的元素相乘,所述加法器对多个所述乘法器的计算结果、所述第三存储空间中当前所述第三矩阵的子集中的元素相加,并利用加法运算的结果更新所述第三存储空间中所述第三矩阵的子集中的元素。
  24. 根据权利要求14-22任一项所述的方法,其特征在于,所述矩阵运算的加速器中包括:处理单元PE,所述PE包括乘法器、加法器和寄存器,所述乘法器的第一输入端和第二输入端分别连接所述存储器的所述第一存储空间和所述第二存储空间,所述乘法器的输出端和所述寄存器的输出端均连接所述加法器的输入端,所述加法器的输出端连接所述寄存器的输入 端,所述加法器的输出端还连接所述存储器的所述第三存储空间;
    其中,所述寄存器存储所述第三存储空间中当前的所述第三矩阵到的子集中的元素;所述乘法器对所述第一矩阵的子集中的元素和所述第二矩阵的子集中的元素相乘;所述加法器对多个所述乘法器的计算结果、所述第三存储空间中当前的所述第三矩阵的子集中的元素对应相加,并利用加法运算的结果更新所述第三存储空间中所述第三矩阵的子集中的元素。
  25. 根据权利要求23或24所述的方法,其特征在于,所述PE中包括的乘法器的数量和所述第一矩阵的子集的规模、所述第二矩阵的子集的规模相关。
PCT/CN2021/099891 2020-07-08 2021-06-12 矩阵运算的方法和加速器 WO2022007597A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21837272.0A EP4180996A4 (en) 2020-07-08 2021-06-12 MATRIX OPERATIONS AND ACCELERATORS
US18/093,929 US20230161835A1 (en) 2020-07-08 2023-01-06 Matrix operation method and accelerator

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010653743.4A CN113918879A (zh) 2020-07-08 2020-07-08 矩阵运算的方法和加速器
CN202010653743.4 2020-07-08

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/093,929 Continuation US20230161835A1 (en) 2020-07-08 2023-01-06 Matrix operation method and accelerator

Publications (1)

Publication Number Publication Date
WO2022007597A1 true WO2022007597A1 (zh) 2022-01-13

Family

ID=79231863

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/099891 WO2022007597A1 (zh) 2020-07-08 2021-06-12 矩阵运算的方法和加速器

Country Status (4)

Country Link
US (1) US20230161835A1 (zh)
EP (1) EP4180996A4 (zh)
CN (1) CN113918879A (zh)
WO (1) WO2022007597A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117093816B (zh) * 2023-10-19 2024-01-19 上海登临科技有限公司 矩阵乘运算方法、装置和电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104636316A (zh) * 2015-02-06 2015-05-20 中国人民解放军国防科学技术大学 面向gpdsp的大规模矩阵乘法计算的方法
CN104899182A (zh) * 2015-06-09 2015-09-09 中国人民解放军国防科学技术大学 一种支持可变分块的矩阵乘加速方法
CN106445471A (zh) * 2016-10-13 2017-02-22 北京百度网讯科技有限公司 处理器和用于在处理器上执行矩阵乘运算的方法
CN109992743A (zh) * 2017-12-29 2019-07-09 华为技术有限公司 矩阵乘法器
US20200201642A1 (en) * 2018-12-20 2020-06-25 Kalray Block-wise matrix multiplication system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10275243B2 (en) * 2016-07-02 2019-04-30 Intel Corporation Interruptible and restartable matrix multiplication instructions, processors, methods, and systems
US10824938B2 (en) * 2017-04-24 2020-11-03 Intel Corporation Specialized fixed function hardware for efficient convolution

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104636316A (zh) * 2015-02-06 2015-05-20 中国人民解放军国防科学技术大学 面向gpdsp的大规模矩阵乘法计算的方法
CN104899182A (zh) * 2015-06-09 2015-09-09 中国人民解放军国防科学技术大学 一种支持可变分块的矩阵乘加速方法
CN106445471A (zh) * 2016-10-13 2017-02-22 北京百度网讯科技有限公司 处理器和用于在处理器上执行矩阵乘运算的方法
CN109992743A (zh) * 2017-12-29 2019-07-09 华为技术有限公司 矩阵乘法器
US20200201642A1 (en) * 2018-12-20 2020-06-25 Kalray Block-wise matrix multiplication system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4180996A4 *

Also Published As

Publication number Publication date
US20230161835A1 (en) 2023-05-25
EP4180996A4 (en) 2024-01-03
EP4180996A1 (en) 2023-05-17
CN113918879A (zh) 2022-01-11

Similar Documents

Publication Publication Date Title
US11720646B2 (en) Operation accelerator
US20180107630A1 (en) Processor and method for executing matrix multiplication operation on processor
CN110096310B (zh) 运算方法、装置、计算机设备和存储介质
WO2019205617A1 (zh) 一种矩阵乘法的计算方法及装置
JP7482636B2 (ja) メモリ装置およびそれを用いたコンピューティング装置
CN115880132B (zh) 图形处理器、矩阵乘法任务处理方法、装置及存储介质
US20170091127A1 (en) Techniques to Couple with a Storage Device via Multiple Communication Ports
US9830731B2 (en) Methods of a graphics-processing unit for tile-based rendering of a display area and graphics-processing apparatus
EP3846036B1 (en) Matrix storage method, matrix access method, apparatus and electronic device
WO2023065983A1 (zh) 计算装置、神经网络处理设备、芯片及处理数据的方法
US11023825B2 (en) Platform as a service cloud server and machine learning data processing method thereof
WO2022007597A1 (zh) 矩阵运算的方法和加速器
CN106227506A (zh) 一种内存压缩系统中的多通道并行压缩解压系统及方法
CN117312330B (zh) 基于便签式存储的向量数据聚集方法、装置及计算机设备
Li et al. Optimized data reuse via reordering for sparse matrix-vector multiplication on fpgas
WO2019223383A1 (zh) 直接内存存取方法、装置、专用计算芯片及异构计算系统
CN115543254A (zh) 一种排序电路、排序方法及电子设备
CN112395008A (zh) 运算方法、装置、计算机设备和存储介质
CN112395009A (zh) 运算方法、装置、计算机设备和存储介质
WO2023115529A1 (zh) 芯片内的数据处理方法及芯片
CN112214443B (zh) 设置于图形处理器中的二次卸载装置和方法
CN111798363B (zh) 图形处理器
CN111382855B (zh) 数据处理装置、方法、芯片及电子设备
JP6115564B2 (ja) データ処理システム、半導体集積回路およびその制御方法
US20230169144A1 (en) Operation method, processor, and related product

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21837272

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2021837272

Country of ref document: EP

Effective date: 20230207