WO2022218374A1 - 用于优化片上系统的矩阵乘操作的方法和相关产品 - Google Patents

用于优化片上系统的矩阵乘操作的方法和相关产品 Download PDF

Info

Publication number
WO2022218374A1
WO2022218374A1 PCT/CN2022/086815 CN2022086815W WO2022218374A1 WO 2022218374 A1 WO2022218374 A1 WO 2022218374A1 CN 2022086815 W CN2022086815 W CN 2022086815W WO 2022218374 A1 WO2022218374 A1 WO 2022218374A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
chip
search
present disclosure
splitting
Prior art date
Application number
PCT/CN2022/086815
Other languages
English (en)
French (fr)
Inventor
孙正
李明
俞烨昊
陈支泽
江广
喻歆
Original Assignee
中科寒武纪科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中科寒武纪科技股份有限公司 filed Critical 中科寒武纪科技股份有限公司
Priority to EP22787596.0A priority Critical patent/EP4325373A1/en
Publication of WO2022218374A1 publication Critical patent/WO2022218374A1/zh
Priority to US18/374,817 priority patent/US20240028666A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/781On-chip cache; Off-chip memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models

Definitions

  • the present disclosure relates generally to the field of data computing. More particularly, the present disclosure relates to a method, apparatus, and computer-readable storage medium for optimizing matrix multiply operations of a system-on-chip.
  • the matrix multiplication operation is a very common data operation operation in the fields of scientific computing and data processing. In the current rapidly developing field of artificial intelligence, it usually involves a large amount of data calculation, including matrix multiplication operations of various types of data.
  • Research hotspots in the field of artificial intelligence - deep learning, such as Deep Neural Networks ("DNN”), Recurrent Neural Networks (“RNN”), and large-scale applications in natural language processing Natural Language Processing
  • DNN Deep Neural Networks
  • RNN Recurrent Neural Networks
  • large-scale applications in natural language processing Natural language processing
  • Many computing tasks involve large-scale matrix multiplication operations, especially the multiplication of two large matrices. As we all know, when the amount of data and the data scale of the matrix multiplication operation involved are larger, the higher the computing power and memory access performance requirements of the computing platform (especially the system-on-chip) are.
  • a processor such as a central processing unit (“CPU”) or a graphics processing unit (“GPU”) is usually used to perform the operation.
  • CPU central processing unit
  • GPU graphics processing unit
  • I/O input/output
  • the present disclosure provides a solution capable of optimizing the matrix multiplication operation of the system-on-chip. Specifically, the present disclosure proposes an optimal way for determining matrix splitting in matrix multiplication operations. By using the optimal splitting method to split the matrix, the matrix multiplication operation of the present disclosure significantly reduces the amount of data transfer with external storage devices, thereby minimizing the I/O bottleneck caused by the bus bandwidth limitation, thereby improving the the efficiency of matrix multiplication. In view of this, the present disclosure provides the aforementioned solutions in the following aspects.
  • the present disclosure discloses a method for optimizing a matrix multiplication operation of a system-on-chip, the method being implemented by one or more processors, and comprising: receiving a first to be split to perform a matrix multiplication operation matrix information of a matrix and a second matrix, wherein the first matrix is M rows by K columns and the second matrix is K rows by N columns; and determining the splitting of the first matrix by minimizing a cost function and the splitting coefficients of the second matrix, the splitting coefficients include the number of rows M b and the number of columns K b of the matrix block obtained after splitting the first matrix and the rows of the matrix block obtained after splitting the second matrix number K b and number of columns N b , where the cost function is used to determine the cost of transferring matrix data between the on-chip system and the off-chip system versus performing the matrix multiplication operation on the on-chip system, where the The cost function is based on at least the data size of the first matrix, the data size of the second matrix, the number of rows M of the first
  • the present disclosure discloses an apparatus for optimizing a matrix multiply operation of a system on chip, comprising: a processor; a memory storing program instructions for optimizing the matrix multiply operation of the system on chip, when the program The instructions, when executed by the processor, cause the apparatus to perform the above-described method.
  • the present disclosure discloses a computer-readable storage medium having stored thereon program instructions for optimizing a matrix multiply operation of a system-on-chip, which, when executed by a processor, implement the above-mentioned method.
  • the present disclosure discloses a system-on-chip for performing a matrix multiply operation, comprising: a plurality of main processing units, wherein each main processing unit includes a plurality of processing sub-units, wherein each processing sub-unit uses for performing corresponding matrix multiplication operations; a plurality of caches for buffering matrix data to be performed matrix multiplication operations and results associated with the matrix multiplication operations, wherein the system-on-chip is configured to perform matrix multiplication operations between matrix blocks, And the matrix block is obtained by splitting the matrix according to the splitting coefficient of the aforementioned method.
  • the present disclosure discloses an integrated circuit device comprising the aforementioned system-on-chip.
  • the present disclosure discloses a board including the aforementioned integrated circuit device.
  • the solution of the present disclosure constructs a cost function of the cost caused by transferring matrix block data between the on-chip system and the off-chip system, and selects the optimal splitting coefficient of the splitting matrix with the goal of minimizing the cost function. Therefore, through the matrix multiplication operation performed based on the optimal splitting coefficient, the solution of the present disclosure can make full use of the on-chip resources of the on-chip system and reduce the I/O data interaction with the external memory of the off-chip system, thereby realizing data Efficient parallel execution of transfer and matrix multiply operations.
  • the solution of the present disclosure also simplifies the complexity of matrix multiplication operations and supports matrix multiplication operations for very large matrices.
  • the solution of the present disclosure can also select an optimal matrix multiplication algorithm from multiple candidate matrix multiplication algorithms, thereby realizing efficient execution of the matrix multiplication operation.
  • FIG. 1 is a schematic diagram illustrating a matrix splitting operation according to an embodiment of the present disclosure
  • FIG. 2 is a flowchart illustrating a method for optimizing a system-on-chip matrix multiply operation according to an embodiment of the present disclosure
  • FIG. 3 is an architectural diagram illustrating a matrix fetch operation according to an embodiment of the present disclosure
  • Fig. 4 is the schematic architecture diagram of L2 cache area shown in Fig. 3;
  • Fig. 5 is the schematic architecture diagram of L1 cache area shown in Fig. 3;
  • 6a and 6b are schematic diagrams illustrating the principle of matrix block splitting according to an embodiment of the present disclosure
  • FIG. 7 is a structural block diagram illustrating a system-on-chip performing a matrix multiplication operation according to an embodiment of the present disclosure
  • FIG. 8 is a schematic diagram illustrating that a computing subunit performs a matrix multiply operation according to an embodiment of the present disclosure
  • FIG. 9 is a flowchart illustrating a method for selecting an optimal matrix multiplication algorithm according to an embodiment of the present disclosure.
  • FIG. 10 is a block diagram illustrating a combined processing apparatus according to an embodiment of the present disclosure.
  • FIG. 11 is a schematic structural diagram illustrating a board according to an embodiment of the present disclosure.
  • the inventors of the present disclosure have found that when two matrices are split in an arbitrary form for matrix multiplication, the splitting action does not significantly change the total computational load of multiplication and addition in the matrix multiplication operation. However, this splitting action can significantly change the amount of I/O between the on-chip system and the off-chip system. Therefore, optimizing the amount of I/O between the on-chip system and the off-chip system becomes the key to improving the performance of matrix multiply operations. In view of this, in order to improve the data access performance between on-chip and off-chip systems in the matrix multiplication operation, improve the operation efficiency of the matrix multiplication operation and reduce the operation cost, the present disclosure proposes a scheme for optimizing the matrix multiplication operation, This involves the determination of split coefficients for splitting matrices of larger size.
  • the matrix splitting operation when two matrices with relatively large sizes are multiplied, the large matrix can be split into blocks, and each split can be divided into blocks.
  • a block ie, a "matrix block” in the context of this disclosure
  • matrix multiplication operations are performed based on that element.
  • the general matrix multiplication problem can be transformed into a block matrix multiplication problem.
  • the multiplication calculation between large matrices can be made clearer and more explicit, which greatly simplifies the calculation.
  • block matrix multiplication is also an important means for the system-on-chip to solve the general matrix multiplication problem.
  • the system-on-chip can only perform the multiplication operation of the two matrix blocks obtained after splitting each time, so that the matrix multiplication operation can be adapted to the limited storage resources and computing resources.
  • the foregoing matrix splitting operation is described below by taking FIG. 1 as an example.
  • FIG. 1 is a schematic diagram illustrating an operation of splitting a matrix according to an embodiment of the present disclosure, wherein the upper part shows the matrix before splitting, and the lower part shows the matrix after splitting as shown by the arrow in the figure.
  • FIG. 1 shows that the A matrix (ie, the "first matrix” of the present disclosure) and the B matrix (ie, the "second matrix” of the present disclosure) perform a matrix multiplication operation, thereby obtaining the C matrix as the result matrix.
  • the A matrix and the B matrix shown in the upper part of FIG. 1 can be divided into operations, wherein a dashed box in the A matrix and the B matrix represents an exemplary matrix block.
  • the A matrix block A 11 (consisting of elements such as a 11 , a 12 , a 21 , a 22 , etc.) and the B matrix block B 11 (consisting of b 11 , b 12 , b , etc.) shown in the lower figure can be obtained 21 , b 22 and other elements), where each matrix block is used as a new element in the split matrix.
  • each matrix block after the A matrix, the B matrix and the C matrix are split is represented as: A block (its size is M b rows*K b columns), B block (its size is M b rows*K b columns) The size is K b rows*N b columns), C block (its size is M b rows*N b columns).
  • the present disclosure proposes a solution for determining the optimal matrix block, so as to realize optimal splitting (or partitioning) of the matrix.
  • the aforementioned M b , K b and N b can be determined.
  • the first and second matrices are split using the determined Mb , Kb and Nb (ie splitting coefficients in the context of this disclosure).
  • the solution of the present disclosure can simplify the matrix multiplication operation, reduce the I/O bottleneck problem caused by the bandwidth limitation of the on-chip system to the greatest extent, and further improve the operation efficiency of the matrix multiplication.
  • FIG. 2 is a flowchart illustrating a method 200 for optimizing a system-on-chip matrix multiply operation in accordance with an embodiment of the present disclosure.
  • a system-on-a-chip usually integrates a complete system on a single chip.
  • the system can generally include various modules such as a system-on-chip control logic module, a microprocessor/microcontroller CPU core module, an embedded memory module, and an interface module that communicates with the off-chip system.
  • the system-on-chip described herein may be a system-on-chip supporting matrix multiplication operations, including a plurality of main computing units to perform matrix multiplication operations and memory for storing matrix data and matrix multiplication results.
  • the aforementioned multiple main computing units can be connected in sequence to form a data transfer loop, and each main computing unit can include multiple computing sub-units, so that matrix splitting and computing sub-units at the main computing unit level can be implemented Level of quadratic matrix splitting, that is, multi-level matrix splitting operation.
  • Level of quadratic matrix splitting that is, multi-level matrix splitting operation.
  • the method 200 of the present disclosure for optimizing a matrix multiply operation of a system-on-chip includes, at step S202 , receiving a first matrix (eg, the A matrix in FIG. 1 ) to be split to perform a matrix multiply operation. and the matrix information of the second matrix (eg, the B matrix in Figure 1).
  • the aforementioned matrix information may include data size and data information of the matrix.
  • the matrix information may indicate that the first matrix is a large matrix of M rows*K columns and the second matrix is a large matrix of K rows*N columns.
  • the matrix information may also include the data size (eg, in bits or bytes) of each element in the first matrix and/or the second matrix.
  • M, K, and N can represent relatively large positive integers, such as 256, 512, 1024, or 2048, due to the need for a split operation, so that the solution of the present disclosure can be applied to the splitting of large matrices and matrices multiply operation.
  • a splitting coefficient for splitting the first matrix and the second matrix is determined by minimizing the cost function.
  • the aforementioned splitting coefficient may include the size of the split matrix block (“block”), such as the number of rows M b and the number of columns K b of the matrix block obtained after splitting the first matrix and the number of rows K b and number of columns N b of the matrix block obtained after splitting the second matrix.
  • the cost function described above is used to determine (or measure) the transfer of matrix data between the on-chip system and the off-chip system (as shown in FIG. 3 ) to the execution of the matrix on the on-chip system The cost of multiplying.
  • the system-on-chip may be arranged with on-chip caches 304, 306 for storing matrix blocks.
  • the off-chip system may be provided with a global memory 302, which may communicate various types of data including matrix blocks with the on-chip cache through an I/O interface.
  • the global memory here may be dynamic random access memory (“DRAM”), such as double rate synchronous dynamic random access memory (“DDR").
  • the present disclosure proposes at least the data scale of the first matrix, the data scale of the second matrix, the number of rows M of the first matrix, the number of columns N of the second matrix, and the splitting coefficient ( That is, M b , K b and N b ) to be determined in the disclosed scheme to construct the expression form of the cost function of the present disclosure.
  • cost function cost of the present disclosure can be expressed in the following form:
  • a size and B size represent the total data size of the A matrix (that is, the first matrix of the present disclosure) and the B matrix (that is, the "second matrix” of the present disclosure), respectively, Indicates a round-up operation.
  • cost function cost of formula (2) can be further expressed as the following form:
  • cost function cost can also be expressed in the following form (that is, the K term in equation (3) is omitted):
  • the solution of the present disclosure further proposes to incorporate a bandwidth utilization coefficient into the cost function, wherein the bandwidth utilization coefficient is equal to the predetermined data Length
  • a bandwidth utilization coefficient ⁇ (L) can be added to the cost function expressed by the aforementioned equations (1)-(4), which is equal to segment by segment according to the data length "L" (eg, every L element in the matrix block).
  • the ratio between the equivalent bandwidth of the loaded matrix block and the full bandwidth (“full bandwidth”).
  • the equivalent bandwidth is the reciprocal of the time it takes to load a matrix block segment by segment according to a certain data length as before, while the full bandwidth refers to the total bandwidth of data transmission between the on-chip system and the off-chip system, which is approximately equal to one-time continuous
  • the cost function in equation (1) can be further expressed as the following equation, for example:
  • lda and ldb represent the leading dimension (“leading dimension”, abbreviated as “ld”) of the A matrix and the B matrix, respectively, where the leading dimension refers to the matrix in either row-major or column-major order.
  • the storage format is the row or column width when stored on an off-chip system.
  • lda b represents the main dimension of the matrix block obtained after the A matrix is split;
  • ldb b represents the main dimension of the matrix block obtained after the B matrix is split.
  • the matrix is row-major order
  • its main dimension is the row width (ie, the number of column elements) of the matrix.
  • the matrix is in column-major order, and its main dimension is the column width (ie, the number of row elements) of the matrix.
  • lda b and ldb b respectively represent the splitting granularity of the A matrix and the B matrix on the main dimension, that is, the number of elements when the matrix block is split on the main dimension.
  • Equation (5) represents the ratio between the equivalent bandwidth of lda b and the full bandwidth ("full bandwidth"), where the equivalent bandwidth of lda b refers to the equivalent bandwidth of lda b
  • the data length (eg K b ) is the reciprocal of the time it takes to load one matrix block piece by piece.
  • method 200 may further include establishing a search space for minimizing the cost function to utilize The search space determines the splitting coefficients.
  • establishing a search space for minimizing the cost function may include partitioning a cache (or cache memory) of the system-on-chip, and building the search space according to a result of the partitioning.
  • the aforementioned cache is arranged to store the split matrix block and the matrix multiplication result obtained by performing a matrix multiplication operation on the split matrix block.
  • the architecture of the present disclosure includes a system-on-chip and a system-off-chip.
  • the DRAM may communicate data with the L2 cache area 304 ("cache") through the DDR interface, such as splitting the matrix block to be performed the matrix multiply operation into matrix sub-blocks and loading them into in L2 cache 304.
  • the L2 cache area 304 may be a system-on-chip shared memory ("Shared Memory”), which may be shared by multiple main computing units.
  • the L2 cache area 304 may perform data transfer with a plurality of L1 cache areas 306 , so that the atomic matrix obtained after the matrix sub-block is split again is transferred to the L1 cache area 306 accordingly.
  • an atomic matrix may be regarded as the smallest matrix unit that computes a matrix multiply operation supported by a subunit.
  • a compute core (“Core") 308 ie, the aforementioned compute subunit
  • the L1 cache area 306 can be regarded as a private storage area for each computing core 308 .
  • a plurality of computing sub-units may form a main computing unit, for example, the four computing cores 308 in FIG. 3 may constitute a main computing unit of the present disclosure.
  • the system-on-chip of the present disclosure may include multiple levels of cache areas.
  • the L2 cache area 304 shown in FIG. 3 may be considered a first level cache area
  • the L1 cache area 306 may be considered a second level cache area.
  • the method of the present disclosure may include establishing a search subspace associated with each level of cache area according to a predetermined matrix multiplication algorithm for performing a matrix multiplication operation.
  • the corresponding first search subspace and second search subspace may be established according to the first level cache area (eg, L2 cache area 304 ) and the second level cache area (eg, L1 cache area 306 ) .
  • the method 200 of the present disclosure may further include: establishing the first search subspace according to settings of a plurality of first caches in the first level cache, wherein the plurality of first caches
  • the buffer is used to store the matrix sub-blocks obtained by splitting the matrix blocks and the intermediate operation results obtained by performing the matrix multiplication operation on the matrix sub-blocks;
  • the result of the intermediate operation obtained.
  • the two-level cache areas are the L2 cache area and the L1 cache area, respectively (ie, the L2 cache area and the L1 cache area shown in Figure 3), and it is assumed that the Cannon ("cannon") algorithm to speed up matrix multiplication operations.
  • buffers three independent high-speed buffers (“buffers”), namely buffer1 , buffer2 and buffer3 shown in FIG. 4 , may be set up on the L2 cache area 304 for the A matrix and the B matrix, respectively.
  • buffer1 can be used to receive data sent by other main computing units
  • buffer2 can load matrix data from global memory (such as the DRAM shown in Figure 3)
  • buffer3 is provided to the main computing unit for use.
  • Data is passed to the L1 cache for real-time computations by the compute subunits and intermediate results are saved at the L1 cache.
  • Space L2 represents the storage capacity of the L2 cache area 304 .
  • the above formula (6) is the aforementioned first search subspace, and the present disclosure searches for suitable M b , K b and N b under the condition that the inequality of formula (6) is satisfied.
  • P 1 is also related to the setting of the main computing unit of the system-on-chip. For example, when the system-on-chip includes 4 main computing units, then "P 1 " takes the value of 2 at this time, that is, each matrix block is divided into 2 parts in the M, K and N dimensions respectively, so that a matrix The block is split into four matrix sub-blocks.
  • P 1 takes the value of 3 at this time, that is, each matrix block is divided into 3 parts in M, K and N dimensions respectively, so that a The matrix block is split into nine matrix sub-blocks.
  • the present disclosure proposes that S 10 and S 11 may be set up on the L2 cache for the A block of the first matrix, the B block of the second matrix, and the C block of the result matrix, respectively. and S 13 independent buffers, and the matrix block is divided into P 10 , P 11 and P 12 in the M, K and N dimensions respectively according to the matrix multiplication algorithm, then the restrictions on the L2 cache area are (that is, the first search subspace of the present disclosure) can be expressed by the following formula (7):
  • Equation (6) can be regarded as a special case of Equation (7).
  • the present disclosure sets a plurality of buffers on the L1 cache area according to the Cannon algorithm to determine the second search subspace of the present disclosure.
  • the solution of the present disclosure proposes that on the L1 cache area, two independent buffers, namely buffer1 and buffer2 shown in FIG.
  • the pipeline operation of the matrix multiplication operation obtained by splitting the matrix sub-blocks again.
  • both buffer1 and buffer2 can alternately perform operations of receiving atomic matrices and participating in operations.
  • an independent buffer is set for the C matrix residing on the L1 cache area, namely buffer3 shown in FIG. 5 , for storing the intermediate result obtained by performing the matrix multiplication operation between atomic matrices.
  • each block in the matrix sub-blocks after the aforementioned split is further split P 0 in M, K and N dimensions, respectively, to obtain the atomic matrix of the present disclosure.
  • the constraints on the L1 cache area can be expressed by the following equation (8):
  • Space L1 represents the storage capacity of the L1 cache area.
  • the above formula (8) is the aforementioned second search subspace, and the present disclosure searches for suitable M b , K b and N b under the condition that the inequality of formula (8) is satisfied.
  • P 0 is also related to the setting of the computing subunit of the system-on-chip. For example, when each main computing unit of the system-on-chip includes 4 computing sub-units, then "P 0 " takes the value of 2, that is, each matrix sub-block is divided into M, K and N dimensions respectively. 2 copies, thereby splitting a matrix sub-block into four atomic matrices.
  • each main computing unit includes 9 computing sub-units
  • P 0 takes the value of 3 at this time, that is, each matrix sub-block is divided into 3 parts in M, K and N dimensions respectively , thereby splitting a matrix subblock into nine atomic matrices.
  • the present disclosure proposes that S 00 , S 01 may be set up on the L1 cache for the A block of the first matrix, the B block of the second matrix, and the C block of the result matrix, respectively. and S 03 independent buffers, and on the basis of the L2 cache layout, the aforementioned matrix sub-blocks are divided into P 00 , P 01 and P 02 in M, K and N dimensions according to the algorithm respectively.
  • the restriction condition on the L1 cache that is, the second search subspace of the present disclosure
  • formula (9) the restriction condition on the L1 cache
  • formula (8) can also be regarded as a special case of formula (9).
  • Equation (8) can be regarded as a special case of Equation (9).
  • determining the splitting coefficient by minimizing the cost function may include determining a search step size for searching the search space, wherein the search step size includes M, K, and M, respectively, and N-dimensional associated search steps ⁇ m, ⁇ k and ⁇ n. Further, a search algorithm may be used to search the search space with the search step size to determine M b , K b and N b that minimize the cost function.
  • the above-mentioned search step size of the present disclosure may be determined by considering the following two factors:
  • the splitting of the matrix to be subjected to the matrix multiplication operation is determined Tiling size ("tiling size").
  • Tiling size the aforementioned computing capability relates to the matrix size that the computing subunit can support in the matrix multiplication operation, that is, the size of the atomic matrix of the present disclosure.
  • the split size here is the size required for the calculation of the subunit matrix multiplication operation after the first matrix or the second matrix of the present disclosure is split by the second level, and after the second level split
  • the atomic matrix obtained after level splitting can be stored in the aforementioned L1 cache.
  • the split sizes supported by the calculation subunits can be different.
  • the split size can be (8 ⁇ 8), (16 ⁇ 16) or ((16*V) ⁇ (16*V)), where "*" represents a multiplication sign, and V is greater than A positive integer of 1.
  • the split size supported by the compute subunit is (16 ⁇ 16)
  • the storage space of the L1 cache when the storage space of the L1 cache is large, it can also store ((16*V) ⁇ (16* Q)), where Q is a positive integer greater than 1.
  • the matrix multiplication operation can be performed by reading a matrix of (16 ⁇ 16) size each time from the aforementioned L1 cache area.
  • the K dimension and the N dimension of the A matrix and the B matrix are aligned to a relatively small positive integer e K and e N respectively.
  • the size of the computable matrix on the computing subunit is an integer multiple of e K and e N in K and N dimensions, respectively.
  • e K and e N may, for example, be taken from values such as 4, 8, 16, 32, or 64 according to the computing capability of the computing subunit.
  • the size of the matrix computable on the computational subunit is an integer multiple of e M , e K and e N in the M, K and N dimensions, respectively.
  • the search steps ⁇ m, ⁇ k and ⁇ n in each row and column dimension of M, K and N satisfy the following, etc. formula condition, wherein P 12 , P 11 , P 10 , P 00 , P 01 and P 02 have the same meanings as in formula (9):
  • n', k', and m' in the above formulas (10) to (15) are any positive integers.
  • the A matrix The split granularity of the and B matrix in the main dimension can be in "L" units.
  • matrix A has no transpose and matrix B has transpose:
  • scm(a,b) represents the least common multiple of a and b.
  • the matrix information of the present disclosure includes the number of main computing units that can participate in the matrix multiplication operation (for example, which can be used to determine the above-mentioned “P 1 ” value size), the number of compute subunits within each of the main compute units (which may be used, for example, to determine the size of the above-mentioned "P 0 " value), and loading from the off-chip system (eg, "DDR") to achieve the highest bandwidth
  • the data size of the utilization rate eg "L" as previously used to determine the search step size.
  • the method of the present disclosure may include determining the search step size according to at least the number of the main computing units, the number of computing subunits, and the data scale.
  • the determination of the search step size also needs to consider the storage format of the matrix and whether it is transposed or not, so the matrix information of the present disclosure can also include the storage format of the first matrix and the second matrix in the off-chip system and Whether the transposition information has undergone a transposition operation, wherein the storage format includes storing in row-major order or column-major order as described above.
  • the method of the present disclosure further includes: determining the search step size according to the storage format and transposition information of the first matrix and the second matrix, that is, The above search steps ⁇ m, ⁇ k and ⁇ n are associated with the M, K and N dimensions, respectively.
  • the method of the present disclosure can use a suitable search algorithm to find the optimal splitting coefficients M b , K b and N b .
  • the search algorithm may include, but is not limited to, various optimization algorithms such as global search, neighborhood search, and genetic algorithm.
  • U1 in the exemplary pseudocode above is the set that satisfies the L1 cache constraints (ie, the second search subspace in the context of this disclosure, as shown in equation (8)) and U2 is the set that satisfies the L2 cache constraints (that is, the first search subspace in the context of this disclosure, as shown in equation (6)).
  • FIGS. 6a and 6b are schematic diagrams respectively illustrating the principle of matrix block splitting according to an embodiment of the present disclosure.
  • one of the matrix blocks obtained by splitting the first matrix of the present disclosure can be determined (as shown) has a size of (M b *K b ).
  • one of the matrix blocks obtained after the second matrix is split (as shown in the figure) is of size (K b *N b ), thus resulting in one of the matrix blocks obtained after matrix splitting (as shown in the figure) has a size of (M b *N b ).
  • the Cannon algorithm with For example, it is equal to in and That is, the split matrix block described above After the resulting two matrix subblocks, while and i.e. split matrix blocks The resulting two matrix subblocks.
  • the above-mentioned matrix sub-blocks can be further divided, so that two matrix sub-blocks (such as ) is converted into a matrix multiplication operation of four atomic matrices.
  • two matrix sub-blocks such as
  • FIG. 6b after further splitting, Each matrix subblock in the matrix (e.g. ) can be split into four atomic matrices in the matrix, as shown and The same split situation applies to and It will not be repeated here.
  • FIG. 7 is a block diagram illustrating a structure of a system-on-chip according to an embodiment of the present disclosure.
  • the system-on-a-chip includes a plurality of main computing units, such as main computing unit 1 to main computing unit 4 shown in the figure.
  • the L2 cache area is shared by the aforementioned multiple main computing units.
  • the L2 cache area is provided with a high-speed buffer for loading matrix data from an off-chip system (such as the DDR schematically shown in FIG. A high-speed buffer for passing data between them and a high-speed buffer for matrix multiplication operations.
  • the figure uses the Cannon algorithm for matrix multiplication as an example, showing the matrix sub-blocks loaded at each main computing unit, eg, main computing unit 1, loaded from the DDR via the L2 cache area. and Main Compute Unit 2 loads from DDR via L2 cache and and so on to perform the first matrix multiplication, e.g.
  • the main computing unit may also receive matrix sub-blocks from the adjacent main computing units to further perform its matrix multiplication operation, thereby obtaining its corresponding matrix sub-block as an intermediate result (for example, the aforementioned matrix sub-blocks). ).
  • main computing unit 1 it can receive from the main computing unit 2 and received from the main computing unit 4 in order to perform the second matrix multiplication operation according to the Cannon algorithm, e.g.
  • main computing unit 3 it can receive from the main computing unit 2 and received from the main computing unit 4 in order to perform the second matrix multiplication operation according to the Cannon algorithm.
  • each main computing unit can obtain a corresponding result matrix sub-block.
  • the main calculation unit 1 will calculate the result matrix sub-block
  • the main calculation unit 2 will calculate the result matrix sub-block And so on.
  • FIG. 8 is a schematic diagram illustrating a calculation subunit performing a matrix multiplication operation according to an embodiment of the present disclosure.
  • each main computing unit in FIG. 7 may include a plurality of computing sub-units that perform matrix multiplication operations in parallel, for example, the four computing units shown in FIG. 8 Subunits, namely calculation subunit 1 to calculation subunit 4.
  • each calculation subunit can obtain the atomic matrix required for the matrix multiplication operation from the L1 cache area, that is, the minimum matrix unit of the matrix multiplication operation supported by the calculation subunit.
  • each main computing unit when each main computing unit obtains the corresponding result matrix sub-block, it needs to complete two matrix multiplication operations. Calculate the first and second rounds of calculations performed by the subunit. As shown in Figure 8, in the first round of computation, computation subunit 1 fetches from the L1 cache and to perform matrix multiplication. Next, in the second round of computation, computation subunit 1 fetches from the L1 cache and to perform matrix multiplication. Finally, by combining the intermediate results of the two rounds of calculation Add up to obtain the intermediate result of the matrix multiplication operation performed by the calculation subunit 1 of the atomic matrix. A similar situation is also applicable to the calculation sub-unit 2 to the calculation sub-unit 4, which will not be repeated here.
  • the optimal algorithm solution refers to selecting an optimal algorithm from a plurality of matrix multiplication algorithms suitable for the matrix multiplication operation to perform the matrix multiplication operation.
  • different search spaces can be set by performing different splitting methods in the system-on-chip, thereby finally obtaining different matrix multiplication algorithms.
  • the split operation when the split operation is performed at the computing subunit level, it may be selected to split only the matrix sub-blocks of the first matrix (eg, split in the M row direction) and not split the matrix sub-blocks of the second matrix to obtain the corresponding search space, and finally form a new matrix multiplication algorithm.
  • the matrix sub-blocks of the second matrix can also be split (eg, split in the direction of N columns) without splitting the matrix sub-blocks of the first matrix, thereby forming another new matrix multiplication algorithm . It can be seen that neither of the two matrix multiplication algorithms obtained above performs the division of the calculation subunit level in the K direction (column direction for the first matrix and row direction for the second matrix).
  • M, K, N, M b , K b and N b have the same meanings as the corresponding items in the foregoing expressions (eg, Equation 2).
  • FIG. 9 is a flowchart illustrating a method 900 for selecting an optimal matrix multiplication algorithm according to an embodiment of the present disclosure.
  • a cost function may be determined, and the manner of determining the cost function may be the manner described above in conjunction with FIG. 2 , which will not be repeated here.
  • the following cost function can be determined:
  • step S904 the search space of each matrix multiplication algorithm in the plurality of matrix multiplication algorithms (ie, the plurality of "candidate algorithms" above) is determined, and at step S906, the search step size of the search space is determined .
  • the manner of determining the search space and the search step size may refer to the foregoing description, which will not be repeated here.
  • step S908 a search algorithm (such as the aforementioned global search, neighborhood search or genetic algorithm) is used, and the search is performed with the determined search step size, so that at step S910, the split corresponding to each matrix multiplication algorithm is determined.
  • Split coefficients eg split coefficients Mbi , Kbi and Nbi for the ith algorithm).
  • the cost function value for each matrix multiplication algorithm may be calculated at step S912, and the matrix multiplication algorithm with the smallest cost function value is determined at step S914. Therefore, at step S916, the matrix multiplication algorithm with the smallest cost function value is selected as the optimal matrix multiplication algorithm, and the corresponding splitting coefficients are used to implement multi-level splitting operations on the large matrix.
  • an optimal algorithm can be selected from a plurality of algorithms for matrix multiplication operations.
  • the selected algorithm can realize the multiplication operation of large matrices with the minimum operation cost, thereby improving the operation efficiency of the matrix multiplication operation and reducing the calculation cost.
  • the resource usage of the system-on-chip is also maximized, thereby giving full play to the computing capability of the system-on-chip.
  • FIG. 10 is a structural diagram illustrating a combined processing apparatus 1000 according to an embodiment of the present disclosure.
  • the combined processing device 1000 includes a computing processing device 1002 , an interface device 1004 , other processing devices 1006 and a storage device 1008 .
  • one or more integrated circuit devices 1010 may be included in the computing processing device, and the integrated circuit device may include the system-on-chip described in the context of the present disclosure for performing matrix multiplication operations between matrices.
  • the matrix may be a large matrix or a very large matrix.
  • the aforementioned large matrix or super-large matrix (such as the aforementioned first matrix and the second matrix in the present disclosure) can be split via the splitting coefficient, so as to obtain a system suitable for the execution of the system-on-chip.
  • Matrix block for matrix multiply operations can be split via the splitting coefficient, so as to obtain a system suitable for the execution of the system-on-chip.
  • the computing processing devices of the present disclosure may be configured to perform user-specified operations, such as the matrix multiply operations of the present disclosure.
  • the computing processing device may be implemented as (or include) a single-core artificial intelligence processor or a multi-core artificial intelligence processor.
  • one or more computing devices included within a computing processing device may be implemented as an artificial intelligence processor core or as part of a hardware structure of an artificial intelligence processor core.
  • multiple computing devices are implemented as an artificial intelligence processor core or a part of the hardware structure of an artificial intelligence processor core, for the computing processing device of the present disclosure, it can be regarded as having a single-core structure or a homogeneous multi-core structure.
  • the computing processing apparatus of the present disclosure may interact with other processing apparatuses through an interface apparatus to jointly complete an operation specified by a user.
  • other processing devices of the present disclosure may include central processing units (Central Processing Unit, CPU), graphics processing units (Graphics Processing Unit, GPU), artificial intelligence processors and other general-purpose and/or special-purpose processors.
  • processors may include, but are not limited to, Digital Signal Processor (DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable Logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • the computing processing device of the present disclosure can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when computing processing devices and other processing devices are considered together, the two can be viewed as forming a heterogeneous multi-core structure.
  • the other processing device may serve as an interface for the computing processing device of the present disclosure (which may be embodied as a related computing device for artificial intelligence such as neural network operations) with external data and control, performing operations including but not limited to Limited to basic controls such as data movement, starting and/or stopping computing devices.
  • other processing apparatuses may also cooperate with the computing processing apparatus to jointly complete computing tasks.
  • the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices.
  • the computing and processing device may obtain input data from other processing devices via the interface device, and write the input data into the on-chip storage device (or memory) of the computing and processing device.
  • the computing and processing device may obtain control instructions from other processing devices via the interface device, and write them into a control cache on the computing and processing device chip.
  • the interface device can also read the data in the storage device of the computing processing device and transmit it to other processing devices.
  • the combined processing device of the present disclosure may also include a storage device. As shown in the figure, the storage device is connected to the computing processing device and the other processing device, respectively.
  • a storage device may be used to store data of the computing processing device and/or the other processing device.
  • the data may be data that cannot be fully stored in an internal or on-chip storage device of a computing processing device or other processing device.
  • the present disclosure also discloses a chip (eg, chip 1102 shown in FIG. 11 ).
  • the chip is a System on Chip (SoC) and integrates one or more combinatorial processing devices as shown in Figure 10, and which can be configured to perform inter-matrix Matrix multiplication operation.
  • the chip can be connected with other related components through an external interface device (such as the external interface device 1106 shown in FIG. 11 ).
  • the relevant component may be, for example, a camera, a display, a mouse, a keyboard, a network card or a wifi interface.
  • other processing units such as video codecs
  • interface modules such as DRAM interfaces
  • the present disclosure also discloses a chip package structure including the above-mentioned chip. In some embodiments, the present disclosure also discloses a board including the above-mentioned chip package structure. The board will be described in detail below with reference to FIG. 11 .
  • FIG. 11 is a schematic structural diagram illustrating a board 1100 according to an embodiment of the present disclosure.
  • the board includes a storage device 1104 for storing data, which includes one or more storage units 1110 .
  • the storage device can be connected and data transferred with the control device 1108 and the chip 1102 described above through, for example, a bus.
  • the board also includes an external interface device 1106, which is configured for data relay or transfer function between the chip (or a chip in a chip package structure) and an external device 1112 (such as a server or a computer).
  • the data to be processed can be transmitted to the chip by an external device through an external interface device.
  • the calculation result of the chip may be transmitted back to the external device via the external interface device.
  • the external interface device may have different interface forms, for example, it may adopt a standard PCIE interface and the like.
  • the control device in the board of the present disclosure may be configured to regulate the state of the chip.
  • the control device may include a single-chip microcomputer (Micro Controller Unit, MCU) for regulating the working state of the chip.
  • MCU Micro Controller Unit
  • the present disclosure also discloses an electronic device or device, which may include one or more of the above-mentioned boards, one or more of the above-mentioned chips and/or one or a plurality of the above-mentioned combined processing devices.
  • the electronic device or apparatus may be configured to perform the matrix multiplication operation discussed in the context of the present disclosure, and the matrix data participating in the matrix multiplication operation is obtained after being split by the optimal splitting coefficient of the present disclosure matrix block.
  • the electronic devices or devices of the present disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, home appliances, and/or medical equipment.
  • the vehicles include airplanes, ships and/or vehicles;
  • the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods;
  • the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph.
  • the electronic equipment or device of the present disclosure can also be applied to the Internet, Internet of Things, data center, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical care and other fields.
  • the electronic device or device of the present disclosure can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge terminal, and terminal.
  • the electronic device or device with high computing power according to the solution of the present disclosure can be applied to a cloud device (eg, a cloud server), while the electronic device or device with low power consumption can be applied to a terminal device and/or Edge devices (such as smartphones or cameras).
  • the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be retrieved from the hardware information of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device. Match the appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-device integration.
  • the present disclosure expresses some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solutions of the present disclosure are not limited by the order of the described actions . Accordingly, those of ordinary skill in the art, based on the disclosure or teachings of this disclosure, will appreciate that some of the steps may be performed in other orders or concurrently. Further, those skilled in the art can understand that the embodiments described in the present disclosure may be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily necessary for the realization of one or some solutions of the present disclosure. In addition, according to different solutions, the present disclosure also has different emphases in the description of some embodiments. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present disclosure, and can also refer to the related descriptions of other embodiments.
  • units illustrated as separate components may or may not be physically separate, and components shown as units may or may not be physical units.
  • the aforementioned components or elements may be co-located or distributed over multiple network elements.
  • some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure.
  • multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit physically exists independently.
  • the above integrated units may be implemented in the form of software program modules. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated unit may be stored in a computer-readable memory. Based on this, when the aspects of the present disclosure are embodied in the form of a software product (eg, a computer-readable storage medium), the software product may be stored in a memory, which may include several instructions to cause a computer device (eg, a personal computer, a server or network equipment, etc.) to execute some or all of the steps of the methods described in the embodiments of the present disclosure.
  • a computer device eg, a personal computer, a server or network equipment, etc.
  • the aforementioned memory may include, but is not limited to, a U disk, a flash disk, a read-only memory (Read Only Memory, ROM), a random access memory (Random Access Memory, RAM), a mobile hard disk, a magnetic disk, or a CD, etc. that can store programs. medium of code.
  • the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits, and the like.
  • the physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but are not limited to, devices such as transistors or memristors.
  • the various types of devices described herein may be implemented by suitable hardware processors, such as CPUs, GPUs, FPGAs, DSPs, ASICs, and the like.
  • the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High Bandwidth Memory (High Bandwidth Memory) , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.
  • a variable resistance memory Resistive Random Access Memory, RRAM
  • Dynamic Random Access Memory Dynamic Random Access Memory
  • SRAM Static Random Access Memory
  • EDRAM Enhanced Dynamic Random Access Memory
  • HBM High Bandwidth Memory
  • HBM Hybrid Memory Cube
  • ROM and RAM etc.
  • any module, unit, component, server, computer, terminal, or device that executes instructions of the disclosed examples may include or otherwise have access to computer-readable media, such as storage media, computer storage media, or data storage devices (which may be removable and/or non-removable) such as magnetic disks, optical disks or tapes.
  • Computer storage media may include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules or other data.
  • phrases "if it is determined” or “if the [described condition or event] is detected” may be interpreted, depending on the context, to mean “once it is determined” or “in response to the determination” or “once the [described condition or event] is detected. ]” or “in response to detection of the [described condition or event]”.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Algebra (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Complex Calculations (AREA)

Abstract

用于优化片上系统的矩阵乘操作的方法和相关产品,方法包括:接收待拆分以便执行矩阵乘操作的第一矩阵和第二矩阵的矩阵信息,第一矩阵为M行×K列并且第二矩阵为K行×N列;通过最小化代价函数来确定用于拆分第一矩阵和第二矩阵的拆分系数,拆分系数包括用于拆分第一矩阵后获得的矩阵块的行数和列数以及拆分第二矩阵后获得的矩阵块的行数和列数,代价函数用于确定在片上系统和片外系统之间传递矩阵数据对在片上系统执行矩阵乘操作所产生的代价。

Description

用于优化片上系统的矩阵乘操作的方法和相关产品
相关申请的交叉引用
本申请要求于2021年4月16日申请的,申请号为2021104141333,名称为“用于优化片上系统的矩阵乘操作的方法和相关产品”的中国专利申请的优先权。
技术领域
本披露一般地涉及数据计算领域。更具体地,本披露涉及一种用于优化片上系统的矩阵乘操作的方法、设备和计算机可读存储介质。
背景技术
矩阵乘操作是科学计算和数据处理领域中非常常见的一种数据运算操作。以当下高速发展的人工智能领域来说,其通常涉及大量的数据计算,其中就包括各种类型数据的矩阵乘操作。在人工智能领域的研究热点——深度学习中,例如深度神经网络(Deep Neural Networks,“DNN”)、循环神经网络(Recurrent Neural Network,“RNN”),以及大规模应用于自然语言处理(Natural Language Processing,“NLP”)领域的变换(“transformer”)网络等,许多计算任务都涉及到大规模的矩阵乘运算,特别是两个大矩阵的相乘操作。众所周知,当涉及的矩阵乘运算数据量和数据尺度越大,则对计算平台(特别是对片上系统)的计算能力和访存性能要求就越高。
在现有的矩阵乘运算中,通常会利用中央处理器(“CPU”)或者图像处理单元(“GPU”)等处理器进行运算。然而,由于受处理器内部存储器资源的容量限制,大规模矩阵乘操作带来的庞大数据运算量会导致处理器的片上系统与外部存储设备之间产生频繁的、大量的数据交互。由于处理器与外部存储器之间的输入/输出(“I/O”)总线的带宽有限,这就会导致严重的I/O瓶颈问题,由此造成的数据传输延迟也会极大地降低并行运算时的运算效率。进一步,不仅I/O总线的带宽限制会成为系统性能的瓶颈,而且处理器与外部存储设备间大量的I/O访存量也会对计算和功耗开销带来非常不利的影响。因此,如何优化矩阵的访存成了提高通用矩阵乘性能的一个非常重要的手段。
发明内容
为了至少解决在上文中所提到的技术问题,本披露提供了一种能够优化片上系统的矩阵乘操作的解决方案。具体来说,本披露提出了用于确定矩阵乘操作中矩阵拆分的一种最优方式。通过利用最优的拆分方式对矩阵进行拆分,本披露的矩阵乘操作显著减少与外部存储设备的数据传输量,从而最大程度地降低总线带宽限制带来的I/O瓶颈问题,进而提高了矩阵乘的运算效率。鉴于此,本披露在如下的多个方面中提供前述的解决方案。
在第一方面中,本披露公开了一种用于优化片上系统的矩阵乘操作的方法,所述方法由一个或多个处理器实现,并且包括:接收待拆分以便执行矩阵乘操作的第一矩阵和第二矩阵的矩阵信息,其中所述第一矩阵为M行×K列并且所述第二矩阵为K行×N列;以及通过最小化代价函数来确定用于拆分第一矩阵和第二矩阵的拆分系数,所述拆分系数包括用于拆分第一矩阵后获得的矩阵块的行数M b和列数K b以及拆分第二矩阵后获得的矩阵块的行数K b和列数N b,其中所述代价函数用于确定在所述片上系统和片外系统之间传递矩阵数据对在所述片上系统执行所述矩阵乘操作所产生的代价,其中所述代价函数至少基于 所述第一矩阵的数据规模大小、所述第二矩阵的数据规模大小、所述第一矩阵的行数M、所述第二矩阵的列数N和所述拆分系数。
在第二方面中,本披露公开了一种用于优化片上系统的矩阵乘操作的设备,包括:处理器;存储器,其存储有用于优化片上系统的矩阵乘操作的程序指令,当所述程序指令由所述处理器执行时,使得所述设备执行上述的方法。
在第三方面中,本披露公开了一种计算机可读存储介质,其上存储有用于优化片上系统的矩阵乘操作的程序指令,当所述程序指令由处理器执行时,实现上述的方法。
在第四方面中,本披露公开了一种用于执行矩阵乘操作的片上系统,包括:多个主处理单元,其中每个主处理单元包括多个处理子单元,其中每个处理子单元用于执行对应的矩阵乘操作;多个高速缓存,用于缓存待执行矩阵乘操作的矩阵数据和与矩阵乘操作相关联的结果,其中所述片上系统配置成执行矩阵块间的矩阵乘操作,并且所述矩阵块是根据前述方法的拆分系数来对矩阵进行拆分而获得的。
在第五方面中,本披露公开了一种集成电路装置,包括前述的片上系统。
在第六方面中,本披露公开了一种板卡,包括前述的集成电路装置。
通过利用本披露上述的方法、设备和计算机可读存储介质,可以确定对参与矩阵乘操作的矩阵的最优拆分方式,从而显著优化矩阵乘操作。具体而言,本披露的方案通过构建片上系统与片外系统间传递矩阵块数据所造成代价的代价函数,并且以最小化该代价函数为目标来选择拆分矩阵的最优拆分系数。由此,通过基于最优拆分系数执行的矩阵乘操作,本披露的方案可以充分利用片上系统的片内资源、减少与片外系统的外部存储器之间的I/O数据交互,从而实现数据传送和矩阵乘操作的高效并行执行。进一步,通过结合硬件架构来对大的矩阵进行多级的拆分,本披露的方案也简化了矩阵乘操作的复杂度并且支持对超大矩阵的矩阵乘操作。在一些实施例中,通过上述的代价函数,本披露的方案还可以从多个候选的矩阵乘算法中选择最优的矩阵乘算法,从而实现对矩阵乘操作的高效执行。
附图说明
通过参考附图阅读下文的详细描述,本披露示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本披露的若干实施方式,并且相同或对应的标号表示相同或对应的部分,其中:
图1是示出根据本披露实施例的矩阵拆分操作的示意图;
图2是示出根据本披露实施例的用于优化片上系统的矩阵乘操作的方法的流程图;
图3是示出根据本披露实施例的用于执行矩阵访存操作的架构图;
图4是图3中所示L2高速缓存区的示意架构图;
图5是图3中所示L1高速缓存区的示意架构图;
图6a和图6b是示出根据本披露实施例的矩阵块拆分原理的示意图;
图7是示出根据本披露实施例的执行矩阵乘操作的片上系统的结构框图;
图8是示出根据本披露实施例的计算子单元执行矩阵乘操作的示意图;
图9是示出根据本披露实施例的用于选择最优矩阵乘算法的方法的流程图;
图10是示出根据本披露实施例的一种组合处理装置的结构图;以及
图11是示出根据本披露实施例的一种板卡的结构示意图。
具体实施方式
下面将结合本披露实施例中的附图,对本披露实施例中的技术方案进行清楚、完整地描述。显然,下文的描述旨在讨论本披露方案的多个示例性实施例,而并非意图对本披露方案的实施例进行穷举式的描述。基于本披露中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本披露公开的方案保护的范围。
本披露的发明人发现当以任意的形式来对两个矩阵进行拆分以便进行矩阵乘操作时,该拆分动作并不会显著改变矩阵乘操作中乘法和加法的总计算量。然而,该拆分动作会显著改变片上系统和片外系统之间的I/O量。因此,优化片上系统和片外系统间的I/O量成为提高矩阵乘操作性能的关键。鉴于此,为了改善矩阵乘操作中的片上与片外系统间的数据访存性能,提升矩阵乘操作的运算效率并减小运算成本,本披露提出了一种用于优化矩阵乘操作的方案,其中涉及对尺寸较大的矩阵进行拆分的拆分系数的确定。
就矩阵的拆分操作而言,如本领域技术人员所知,当两个尺寸比较大的矩阵进行相乘时,可以考虑将大的矩阵通过分块的方式进行拆分,并将每个分块(即本披露上下文中的“矩阵块”)视为该矩阵的一个元素,并且以该元素为基础来进行矩阵乘运算。通过这样的拆分操作,可以将通用矩阵乘问题转换成分块矩阵乘问题。由此,可以令大矩阵间的相乘计算变得更为清晰和明确,从而大幅简化计算。进一步,考虑到计算设备的片上系统的存储资源和计算资源都非常有限,分块矩阵乘也是片上系统解决通用矩阵乘问题的一项重要手段。通过预先根据片上系统的片上资源对大矩阵进行拆分,片上系统就可以每次只进行拆分后得到的两个矩阵块的相乘操作,从而令矩阵乘操作适配于有限的存储资源和计算资源。下面以图1为例来对前述的矩阵拆分操作进行说明性的描述。
图1是示出根据本披露实施例的对矩阵进行拆分操作的示意图,其中上部示出未拆分前的矩阵,而下部如图中箭头所示是拆分后的矩阵。具体来说,图1所示为A矩阵(即本披露的“第一矩阵”)和B矩阵(即本披露的“第二矩阵”)执行矩阵乘操作,从而得到C矩阵作为结果矩阵。为了实现在片上系统执行矩阵乘操作,可以对图1上部所示的A矩阵和B矩阵进行拆分操作,其中A矩阵和B矩阵中的一个虚框代表一个示例性的矩阵块。由此,可以得到下部图中所示出的A矩阵块A 11(由a 11,a 12,a 21,a 22等元素所构成)和B矩阵块B 11(由b 11,b 12,b 21,b 22等元素所构成),其中每个矩阵块作为拆分后的矩阵中的一个新元素。经过这样的拆分操作,可以将图1上部的通用矩阵相乘转换成下部所示出的分块矩阵相乘,即C M’×N’=A M’×K’*B K’×N’,其中分块矩阵乘操作可以通过如下算式来表达:
Figure PCTCN2022086815-appb-000001
其中0<i<=M’,并且0<j<=N’。
为了便于描述,在本披露的上下文中,将A矩阵、B矩阵和C矩阵拆分后的每个矩阵块表示为:A block(其大小为M b行*K b列)、B block(其大小为K b行*N b列)、C block(其大小为M b行*N b列)。基于此,本披露提出了确定最优矩阵块的方案,从而实现对矩阵进行最优拆分(或者说分块)。通过本披露的方案,可以确定前述的M b、K b和N b。利用确定的M b、K b和N b(也即本披露上下文中的拆分系数)来对第一矩阵和第二矩阵进行拆分。本披露的方案可以简化矩阵乘操作,最大程度地降低片上系统的带宽限制而造成的I/O瓶颈问题,进而提升矩阵乘的运算效率。
图2是示出根据本披露实施例的用于优化片上系统的矩阵乘操作的方法200的流程图。如本领域技术人员,片上系统通常是在单个芯片上集成一个完整的系统。该系统一般可以 包括系统级芯片控制逻辑模块、微处理器/微控制器CPU内核模块、嵌入的存储器模块以及与片外系统进行通信的接口模块等各种模块。在本披露的上下文,这里所述的片上系统可以是支持矩阵乘操作的系统级芯片,其中包括有执行矩阵乘操作的多个主计算单元和用于存储矩阵数据和矩阵乘结果的存储器。根据本披露的上下文,前述的多个主计算单元可以依次连接形成数据传递的回路并且每个主计算单元可以包括多个计算子单元,从而可以实现主计算单元级的矩阵拆分和计算子单元级的二次矩阵拆分,即多级的矩阵拆分操作。基于此,可以理解的是本披露的拆分系数也与片上系统的主计算单元的数目和每个主单元所包含的计算子单元的数目相关。关于主计算单元和计算子单元的示例性连接和布置,稍后将结合附图进行详细地描述。
如图2中所示,本披露的用于优化片上系统的矩阵乘操作的方法200包括在步骤S202处,接收待拆分以便执行矩阵乘操作的第一矩阵(例如图1中的A矩阵)和第二矩阵(例如图1中的B矩阵)的矩阵信息。根据不同的实施方式,前述矩阵信息可以包括矩阵的数据规模大小和数据信息。例如,矩阵信息可以指示第一矩阵为M行*K列的大矩阵并且第二矩阵为K行*N列的大矩阵。进一步,矩阵信息还可以包括第一矩阵和/或第二矩阵中每个元素的数据大小(例如以比特或字节计)。可以理解的是由于需要进行拆分操作,因此M、K和N可以代表相对较大的正整数,例如256、512、1024或2048,从而本披露的方案可以应用于大矩阵的拆分和矩阵乘操作。
在步骤S204处,通过最小化代价函数来确定用于拆分第一矩阵和第二矩阵的拆分系数。在一个实施例中,前述的拆分系数可以包括拆分后的矩阵块(“block”)的大小,例如用于拆分第一矩阵后获得的矩阵块的行数M b和列数K b以及拆分第二矩阵后获得的矩阵块的行数K b和列数N b
根据本披露的实施例,上述代价函数用于确定(或者说衡量)在所述片上系统和片外系统之间(如图3中所示)传递矩阵数据对在所述片上系统执行所述矩阵乘操作所产生的代价。在一个实施场景中,片上系统可以布置有用于存储矩阵块的片上高速缓冲存储器304、306。与之相对应,片外系统可以布置有全局存储器302,该全局存储器可以通过I/O接口与片上高速缓冲存储器传递包括矩阵块在内的各种类型数据。在一个场景中,此处的全局存储器可以是动态随机存储器(“DRAM”),例如双倍速率同步动态随机存储器(“DDR”)。
基于上述代价函数的使用目的,本披露提出至少基于上述第一矩阵的数据规模大小、第二矩阵的数据规模大小、第一矩阵的行数M、第二矩阵的列数N和拆分系数(即本披露方案待确定的M b、K b和N b)来构建本披露的代价函数的表达形式。
作为一个示例,本披露的代价函数cost可以表达为如下的形式:
Figure PCTCN2022086815-appb-000002
其中A size和B size分别表示A矩阵(即本披露的第一矩阵)和B矩阵(即本披露的“第二矩阵”)的总数据规模大小,
Figure PCTCN2022086815-appb-000003
表示向上取整操作。当使用dw(A)和dw(B)分别表示A矩阵和B矩阵中每个数据元素的大小(以比特或字节计)时,则上述(2)式中的A size和B size可以分别表示为A size=M×K×dw(A),B size=N×K×dw(B)。
基于上述A size和B size的等效替换,式(2)的代价函数cost还可以进一步表达为如下的形式:
Figure PCTCN2022086815-appb-000004
进一步,代价函数cost还可以表达为如下的形式(即省略式(3)中的K项):
Figure PCTCN2022086815-appb-000005
在一个实施例中,考虑到从大矩阵中加载矩阵块对整体耗时带来的影响,本披露的方案还提出将带宽利用率系数纳入代价函数,其中所述带宽利用率系数等于按预定数据长度从所述片外系统加载矩阵块时的等效带宽与片上系统和片外系统之间的总带宽之间的比值。例如,可以在前述(1)-(4)式表达的代价函数中添加带宽利用率系数γ(L),其等于按照数据长度“L”(例如矩阵块中的每L个元素)逐段地加载矩阵块的等效带宽与全带宽(“full bandwidth”)之间的比值。这里,等效带宽是如前按一定数据长度逐段地加载完一个矩阵块所用时间的倒数,而全带宽是指片上系统和片外系统之间数据传输的总带宽,它近似等于一次性连续地将矩阵块从片外系统加载到片上系统时所用时间的倒数。
通过引入上述的带宽利用率系数,式(1)中的代价函数例如可以进一步如下式表达:
Figure PCTCN2022086815-appb-000006
在式(5)中,lda和ldb分别表示A矩阵和B矩阵的主维度(“leading dimension”,缩写为“ld”),其中主维度是指矩阵以行主序或列主序之一的存储格式存储于片外系统时的行宽或列宽。公式(5)中,lda b表示A矩阵拆分后获得的矩阵块的主维度;ldb b表示B矩阵拆分后获得的矩阵块的主维度。例如,当矩阵逐行地存储于片外系统时,则该矩阵是行主序的,并且其主维度即为该矩阵的行宽(即列元素的个数)。类似地,当矩阵逐列地存储于片外系统时,则该矩阵是列主序的,并且其主维度即为该矩阵的列宽(即行元素的个数)。
进一步,lda b和ldb b分别表示A矩阵和B矩阵在主维度上的拆分粒度,也即矩阵块在主维度上拆分时的元素个数。例如,在行主序存储于片外系统、且A矩阵和B矩阵均不转置的条件下,则lda b=K b并且ldb b=N b;在行主序存储于片外系统、A矩阵不转置而B矩阵转置的条件下,lda b=K b并且ldb b=K b。基于此,与带宽利用率系数γ(L)的上文描述类似,式(5)中的“γ(ldb b)”表示ldb b的等效带宽与全带宽(“full bandwidth”)之间的比值,其中ldb b的等效带宽是指以ldb b的数据长度(例如N b)逐段加载完一个矩阵块所用时间的倒数。同样地,式(5)中的“γ(lda b)”表示lda b的等效带宽与全带宽(“full bandwidth”)之间的比值,其中lda b的等效带宽是指以lda b的数据长度(例如K b)逐段加载完一个矩阵块所用时间的倒数。
尽管在图2中未示出,在一个实施例中,在通过最小化代价函数来确定所述拆分系数中,方法200还可以包括建立用于最小化所述代价函数的搜索空间,以便利用所述搜索空间确定所述拆分系数。在一个实施例中,建立用于最小化所述代价函数的搜索空间可以包括对所述片上系统的高速缓冲区(或称高速缓冲存储器)进行划分,以及根据划分结果来建立所述搜索空间。此处,前述高速缓冲区被布置用于存储拆分后的矩阵块和对所述拆分后的矩阵块执行矩阵乘操作所获得的矩阵乘结果。为了便于理解本披露的片上系统的存储划分,现在将首先结合图3来进行描述。
图3是示出根据本披露实施例的用于执行矩阵访存操作的架构图。如图3中所示,本披露的架构包括片上系统和片外系统。图中为了简化的目的,片外系统中仅示例性示出全局存储器DRAM 302。在执行矩阵块的加载过程中,DRAM可以通过DDR接口来与L2高速缓存区304(“cache”)进行数据传递,例如将待执行矩阵乘操作的矩阵块拆分成矩阵 子块,并且加载到L2高速缓存区304中。在一个实施例中,L2高速缓存区304可以是片上系统的共享存储器(“Shared Memory”),其可以由多个主计算单元来共享。
进一步,L2高速缓存区304可以与多个L1高速缓存区306进行数据传递,以便将矩阵子块再次拆分后得到的原子矩阵相应地传输到L1高速缓存区306。在本披露的上下文中,原子矩阵可以视为计算子单元所支持的矩阵乘操作的最小矩阵单元。此后,计算核(“Core”)308(也即前述的计算子单元)可以从L1高速缓存区306获取原子矩阵,以便执行原子矩阵间的矩阵乘操作。在该场景下,L1高速缓存区306可以视为各个计算核308的私有存储区。根据本披露的方案,多个计算子单元可以形成一个计算主单元,例如图3中的四个计算核308就可以构成本披露的一个计算主单元。
基于上述的描述,本领域技术人员可以理解本披露的片上系统可以包括多级高速缓存区。由此,例如图3中示出的L2高速缓存区304可以视为一级高速缓存区,并且L1高速缓存区306可以视为二级高速缓存区。基于此,本披露的方法可以包括根据用于执行矩阵乘操作的预定矩阵乘算法,建立与每一级高速缓存区关联的搜索子空间。在一个实施场景中,可以根据一级高速缓存区(例如L2高速缓存区304)和二级高速缓存区(例如L1高速缓存区306)来建立对应的第一搜索子空间和第二搜索子空间。
鉴于上述场景,本披露的方法200还可以包括:根据所述一级高速缓存区中的多个第一高速缓冲区的设置来建立所述第一搜索子空间,其中所述多个第一高速缓冲区用于存储拆分所述矩阵块所获得的矩阵子块和对所述矩阵子块执行矩阵乘操作所获得的中间运算结果;以及根据所述二级高速缓存区的多个第二高速缓冲区的设置来建立所述第二搜索子空间,其中所述多个第二高速缓冲区用于存储拆分所述矩阵子块所获得的原子矩阵和对所述原子矩阵执行矩阵乘操作所获得的中间运算结果。
下面将以图4和图5为例来讨论如何基于两级高速缓存区来构建本披露的搜索空间。如图4和图5所示,两级高速缓存区分别为L2高速缓存区和L1高速缓存区(也即图3中所示出的L2高速缓存区和L1高速缓存区),并且假定采用加农(“cannon”)算法来加速矩阵乘运算操作。
首先,可以在L2高速缓存区304上针对A矩阵和B矩阵分别设立三块独立的高速缓冲区(“buffer”),即图4中所示出的buffer1、buffer2和buffer3。就使用目的而言,buffer1可以用于接收其它主计算单元发送来的数据,buffer2可以从全局存储器(例如图3中所示出的DRAM)加载矩阵数据,而buffer3提供给主计算单元以用于向L1高速缓存区传递数据,以便由计算子单元进行实时计算并且在L1高速缓存区处保存中间结果。基于前述的布置,并且考虑将矩阵块(例如前述的A block、B block和C block)根据第一级加农算法分别在M、K和N维度(也即两个矩阵的行和列方向上)做P 1份的拆分,从而拆分成P 1个矩阵子块,则L2高速缓存上的限制条件可以通过下式(6)来表达:
Figure PCTCN2022086815-appb-000007
其中Space L2表示L2高速缓存区304的存储容量。上式(6)即前述的第一搜索子空间,并且本披露在满足式(6)的不等式的条件下搜索合适的M b、K b和N b。另外,需要注意的是上述的“P 1”也与片上系统的主计算单元的设置相关。例如,当片上系统包括4个主计算单元时,则此时“P 1”取值为2,即在M、K和N维度上分别将每个矩阵块拆分成2份,从而将一个矩阵块拆分成四个矩阵子块。类似地,当片上系统包括9个主计算单元时, 则此时“P 1”取值为3,即在M、K和N维度上分别将每个矩阵块拆分成3份,从而将一个矩阵块拆分成九个矩阵子块。
可以理解的是上述第一搜索子空间是在考虑了用于矩阵乘的加农算法后所确定的。更具一般性地,对于任意的矩阵乘算法,本披露提出在L2高速缓存区上可以为第一矩阵的A block、第二矩阵的B block以及结果矩阵的C block分别设立S 10、S 11和S 13块独立的buffer,并且把矩阵块根据矩阵乘算法分别在M、K和N维度分别做P 10份、P 11份和P 12份的拆分,则L2高速缓存区上的限制条件(即本披露的第一搜索子空间)可以通过下式(7)来表达:
Figure PCTCN2022086815-appb-000008
可以看出,当根据式(7)为C block设立0(即S 13=0)块的buffer时(也即不在L2高速缓存区上设置用于存储结果矩阵的buffer),并且将矩阵块根据矩阵乘算法分别在M、K和N维度分别做P 1、P 1、P 1份(即P 10=P 11=P 12=P 1)的拆分时,则式(7)所表达的第一搜索子空间就转换成式(6)所表达的第一搜索子空间。因此,式(6)可以视为式(7)的一种特例情形。
在如上操作以确定第一搜索子空间后,本披露根据加农算法在L1高速缓存区上设置多个buffer,以确定本披露的第二搜索子空间。为此,本披露的方案提出在L1高速缓存区上,可以为A矩阵和B矩阵分别设立两块独立的buffer,即图5中所示出的buffer1和buffer2,以用于原子矩阵(其经由对矩阵子块进行再次拆分而获得)的矩阵乘操作的流水运算。就流水运算而言,buffer1和buffer2二者可以交替执行接收原子矩阵和参与运算的操作。接着,为驻留在L1高速缓存区上的C矩阵设置一块独立的buffer,即图5中所示出的buffer3,以用于存储执行原子矩阵间的矩阵乘操作所获得的中间结果。与第一搜索子空间的确定类似,在第一级加农算法的基础上(例如前述的将矩阵块A block、B block和C block在M、K和N维度上拆分P 1份,以得到本披露的矩阵子块),再根据第二级加农算法分别在M、K和N维度对前述拆分后的矩阵子块中的每块进一步拆分P 0份,以得到本披露的原子矩阵。基于此,L1高速缓存区上的限制条件可以通过下式(8)来表达:
Figure PCTCN2022086815-appb-000009
其中Space L1表示L1高速缓存区的存储容量。上式(8)即前述的第二搜索子空间,并且本披露在满足式(8)的不等式的条件下搜索合适的M b、K b和N b。另外,需要注意的是与前述的“P 1”相类似,“P 0”也与片上系统的计算子单元的设置相关。例如,当片上系统的每个主计算单元包括4个计算子单元时,则此时“P 0”取值为2,即在M、K和N维度上分别将每个矩阵子块拆分成2份,从而将一个矩阵子块拆分成四个原子矩阵。类似地,当每个主计算单元包括9个计算子单元时,则此时“P 0”取值为3,即在M、K和N维度上分别将每个矩阵子块拆分成3份,从而将一个矩阵子块拆分成九个原子矩阵。
可以理解的是上述的第二搜索子空间是在考虑了用于矩阵乘的加农算法后所确定的。更具一般性地,对于任意的矩阵乘算法,本披露提出在L1高速缓存区上可以为第一矩阵的A block、第二矩阵的B block以及结果矩阵的C block分别设立S 00、S 01和S 03块独立的buffer,并且在L2高速缓存区布置的基础上,把前述的矩阵子块根据算法分别在M、K和N维度又做了P 00、P 01和P 02份的拆分,则L1缓存上的限制条件(即本披露的第二搜索子空间) 可以通过下式(9)来表达:
Figure PCTCN2022086815-appb-000010
与式(6)可以视为式(7)的一种特例情形类似,式(8)也可以视为式(9)的一种特例。例如,当P 10=P 11=P 12=P 1,P 00=P 01=P 02=P 0,并且S 00=S 01=S 03=2时,则式(9)所表达的第二搜索子空间就转换成式(8)所表达的第二搜索子空间。因此,式(8)可以视为式(9)的一种特例情形。
上面结合图4和图5对本披露的搜索空间进行了详细地描述。在确定了前述的搜索空间后,则通过最小化代价函数来确定拆分系数可以包括确定用于对所述搜索空间进行搜索的搜索步长,其中所述搜索步长包括分别与M、K和N维度关联的搜索步长Δm、Δk和Δn。进一步,可以利用搜索算法在所述搜索空间中以所述搜索步长进行搜索,以确定最小化所述代价函数的M b、K b和N b
在一个实施例中,可以通过考虑下面的两个因素来确定本披露的上述搜索步长:
(i)根据片上系统的多个计算子单元(例如图3中的“core”或图8中的计算子单元)中每个计算子单元的计算能力来确定待进行矩阵乘操作的矩阵的拆分尺寸(“tiling size”)。在本披露的方案中,前述的计算能力涉及计算子单元在矩阵乘操作中所能支持的矩阵大小,即本披露的原子矩阵的大小。在前文两级加农算法的场景中,这里的拆分尺寸也即本披露的第一矩阵或第二矩阵经第二级拆分后满足计算子单元矩阵乘运算所要求的尺寸大小,并且经二级拆分后所获得的原子矩阵可以存储于前述的L1高速缓存区中。根据不同的硬件架构和矩阵乘算法,计算子单元所支持的拆分尺寸可以不同。例如,根据不同的场景,拆分尺寸可以(8×8)、(16×16)或者是((16*V)×(16*V)),这里“*”表示乘号,并且V是大于1的正整数。在一个场景中,假设计算子单元支持的拆分尺寸为(16×16),则当L1高速缓存区的存储空间较大时,也可以在其上存储((16*V)×(16*Q))大小的矩阵,其中Q是大于1的正整数。在执行矩阵乘运算时,可以从前述的L1高速缓存区中每次读取(16×16)大小的矩阵来进行矩阵乘运算。
鉴于上文的描述,可以考虑令A矩阵和B矩阵的K维度和N维度分别对准到某一个比较小的正整数e K和e N。换句话说,计算子单元上可计算的矩阵大小在K维度和N维度分别是e K和e N的整数倍。在不同的应用场景中,e K和e N根据计算子单元的计算能力可以例如取自4、8、16、32或64等数值。假设M维度没有前述的对准限制,并且考虑矩阵块在主计算单元间和主计算单元内依据加农算法进行拆分,则M、K和N各行和列维度上的搜索步长Δm、Δk和Δn满足下面的等式条件,其中P 0和P 1具有与式(8)中相同的含义。
Δn=n′×P 1×P 0×e N            (10);
Δk=k′×P 1×P 0×e K,           (11);
Δm=m′×P 1×P 0,              (12)。
更具一般性地,计算子单元上可计算的矩阵大小在M、K和N维度分别是e M、e K和e N的整数倍。当考虑矩阵块在L1高速缓存区和L2高速缓存区两级存储空间上的拆分和存储时,则M、K和N各行和列维度上的搜索步长Δm、Δk和Δn满足下面的等式条件,其中P 12、P 11、P 10、P 00、P 01和P 02具有与式(9)中相同的含义:
Δn=n′×P 12×P 02×e N,           (13);
Δk=k′×P 11×P 01×e K,        (14);
Δm=m′×P 10×P 00×e M         (15),
其中上式(10)~式(15)中的n′、k′、m′为任意的正整数。
(ii)根据A矩阵和B矩阵存储格式(例如列主序或者行主序)及转置方式确定矩阵的主维度lda和ldb,同时根据DRAM或DDR(即本披露所述的片外系统或全局存储器)上传和下载数据的性能来确定A矩阵和B矩阵在主维度上的搜索步长。
在一个场景中,假设A矩阵和B矩阵都是按照行主序在DDR中存储的,而且A矩阵和B矩阵都没有转置,则A矩阵的主维度为K,即lda=K,B矩阵的主维度为N,即ldb=N;在行主序的情况下,如果B矩阵需要转置,则B矩阵的主维度为K,即ldb=K。如果A矩阵和B矩阵是按照列主序在DDR中存储的,则情况与前述的行主序恰好相反。例如,当A矩阵和B矩阵都不需要转置的情况下,则A矩阵的主维度为M,即lda=M,而B矩阵的主维度为K,即ldb=K。
当考虑L2高速缓存区或者L1高速缓存区每次从DDR加载数据时,假定一次加载的数据规模大小为“L”时能达到最高的带宽利用率(即前述的逐段加载),则A矩阵和B矩阵在主维度的拆分粒度可以以“L”为单位。例如,在行主序的情况,当A矩阵和B矩阵都不需要转置时,则A矩阵的主维度是K且因此Δk=k″×L,而B矩阵的主维度是N且因此Δn=n″×L,此处k″和n″可以是任意的正整数。
综合考虑以上(i)和(ii)处所描述的内容,可以得到如下M、K和N各维度上的搜索步长Δm、Δk和Δn的求解表达式:
在行主序、A矩阵和B矩阵都没有转置的情况下:
Δm=P 1×P 0
Δk=scm(P 1×P 0×e K,L)
Δn=scm(P 1×P 0×e N,L)
在行主序、A矩阵无转置而B矩阵有转置的情况下:
Δm=P 1×P 0
Δk=scm(P 1×P 0×e K,L)
Δn=P 1×P 0×e N
在列主序、A矩阵和B矩阵都没有转置的情况下:
Δm=scm(P 1×P 0,L)
Δk=scm(P 1×P 0×e K,L)
Δn=P 1×P 0×e N
在列主序、A矩阵无转置而B矩阵有转置的情况下:
Δm=scm(P 1×P 0,L)
Δk=P 1×P 0×e K
Δn=scm(P 1×P 0×e N,L)
其中scm(a,b)表示求解a和b的最小公倍数。
基于上面的描述,本领域技术人员可以理解的是为了确定搜索步长,本披露的矩阵信息包括可以参与矩阵乘操作的主计算单元的数目(其例如可以用于确定上述“P 1”值的大小)、每个所述主计算单元内的计算子单元的数目(其例如可以用于确定上述“P 0”值的大 小)和从所述片外系统(例如“DDR”)加载达到最高带宽利用率的数据规模大小(例如前述用于确定搜索步长的“L”)。基于此,在确定搜索步长中,本披露的方法可以包括至少根据所述主计算单元的数目、计算子单元的数目和所述数据规模大小来确定所述搜索步长。
进一步地,如上所述,搜索步长的确定还需要考虑矩阵的存储格式和转置与否,因此本披露的矩阵信息还可以包括第一矩阵和第二矩阵在片外系统中的存储格式以及是否经过转置操作的转置信息,其中所述存储格式包括以如上文所述的行主序或列主序进行存储。基于此,在确定所述搜索步长中,本披露的所述方法还包括:根据所述第一矩阵和第二矩阵的所述存储格式和转置信息来确定所述搜索步长,也即上述分别与M、K和N维度关联的搜索步长Δm、Δk和Δn。
在获取上述的搜索步长后,本披露的方法可以采用合适的搜索算法来寻找最优的拆分系数M b、K b和N b。关于搜索算法,其可以包括但不限于全局搜索、邻域搜索、遗传算法等各种最优化算法。
仅为了示例的目的,下面以伪代码的形式示出通过全局搜索(“Global Search”)算法来获得最终的矩阵块的拆分系数:
Global Search:
Figure PCTCN2022086815-appb-000011
其中上面示例性伪代码中的U1为满足L1高速缓存区限制条件的集合(即本披露上下文中的第二搜索子空间,如式(8)所示)并且U2为满足L2高速缓存区限制条件的集合(即本披露上下文中的第一搜索子空间,如式(6)所示)。
上文结合图1-图5对本披露的用于优化片上系统的矩阵乘操作的方法进行了描述。通过利用本披露的方法,可以确定用于拆分矩阵的最优拆分系数。由此,当使用最优的拆分系数对矩阵进行拆分以用于矩阵乘时,其所造成的数据传输方面(例如I/O开销)的代价最小。基于此,执行矩阵乘操作的硬件平台将以更为高效和计算成本更低的方式来进行矩阵乘计算。
图6a和图6b是分别示出根据本披露实施例的矩阵块拆分原理的示意图。如图中所示,并且结合前文可知,通过利用本披露的优化算法所确定的拆分系数,可以确定本披露的第一矩阵经拆分后获得的矩阵块之一
Figure PCTCN2022086815-appb-000012
(如图所示)的大小为(M b*K b)。类似地,第二矩阵经拆分后获得的矩阵块之一
Figure PCTCN2022086815-appb-000013
(如图中所示)的大小为(K b*N b),从而结果矩阵拆分后获得的矩阵块之一
Figure PCTCN2022086815-appb-000014
(如图中所示)的大小为(M b*N b)。根据加农算法,以
Figure PCTCN2022086815-appb-000015
为例,其等于
Figure PCTCN2022086815-appb-000016
其中
Figure PCTCN2022086815-appb-000017
Figure PCTCN2022086815-appb-000018
即前文所述的 拆分矩阵块
Figure PCTCN2022086815-appb-000019
后所得到的两个矩阵子块,而
Figure PCTCN2022086815-appb-000020
Figure PCTCN2022086815-appb-000021
即拆分矩阵块
Figure PCTCN2022086815-appb-000022
后得到的两个矩阵子块。
如前所述,还可以对上述的矩阵子块进行进一步拆分,从而可以将两个矩阵子块(如
Figure PCTCN2022086815-appb-000023
)的矩阵乘操作转化成四个原子矩阵的矩阵乘操作。例如图6b中所示,经过进一步拆分,
Figure PCTCN2022086815-appb-000024
矩阵中的每个矩阵子块(例如
Figure PCTCN2022086815-appb-000025
)又可以拆分成如
Figure PCTCN2022086815-appb-000026
矩阵中的四个原子矩阵,即所示的
Figure PCTCN2022086815-appb-000027
Figure PCTCN2022086815-appb-000028
同样的拆分情形也适用于
Figure PCTCN2022086815-appb-000029
Figure PCTCN2022086815-appb-000030
此处不再赘述。
图7是示出根据本披露实施例的片上系统的结构框图。如图7中所示,该片上系统包括多个主计算单元,如图中所示出的主计算单元1~主计算单元4。进一步示出的L2高速缓存区由前述的多个主计算单元共享。如前结合图4所描述的,该L2高速缓存区中设置有用于从片外系统(例如图7中示意性示出的DDR)加载矩阵数据的高速缓冲区、用于相邻主计算单元之间传递数据的高速缓冲区以及用于矩阵乘运算的高速缓冲区。
仅为了说明性地目的,图中以用于矩阵乘运算的加农算法为例,示出在各个主计算单元处加载的矩阵子块,例如主计算单元1经由L2高速缓存区从DDR加载
Figure PCTCN2022086815-appb-000031
Figure PCTCN2022086815-appb-000032
主计算单元2经由L2高速缓存区从DDR加载
Figure PCTCN2022086815-appb-000033
Figure PCTCN2022086815-appb-000034
以此类推,以便执行第一矩阵乘运算,例如
Figure PCTCN2022086815-appb-000035
在执行矩阵乘计算过程中,主计算单元还可以从相邻的主计算单元接收矩阵子块,以进一步执行其矩阵乘运算,从而获得其对应的、作为中间结果的矩阵子块(例如前述的
Figure PCTCN2022086815-appb-000036
)。以主计算单元1为例,其可以从主计算单元2接收
Figure PCTCN2022086815-appb-000037
并且从主计算单元4接收
Figure PCTCN2022086815-appb-000038
以便根据加农算法来执行第二矩阵乘运算,例如
Figure PCTCN2022086815-appb-000039
类似地,以主计算单元3为例,其可以从主计算单元2接收
Figure PCTCN2022086815-appb-000040
并且从主计算单元4接收
Figure PCTCN2022086815-appb-000041
以便根据加农算法来执行第二矩阵乘运算。在经过前述的两次矩阵乘运算后,通过将得到的两个中间结果相加,则每个主计算单元可以获得对应的结果矩阵子块。例如,主计算单元1将计算得到结果矩阵子块
Figure PCTCN2022086815-appb-000042
主计算单元2将计算得到结果矩阵子块
Figure PCTCN2022086815-appb-000043
以此类推。
图8是示出根据本披露实施例的计算子单元执行矩阵乘操作的示意图。如前所述,在支持本披露的优化方案的硬件架构中,图7中的每个主计算单元可以包括多个并行执行矩阵乘操作的计算子单元,例如图8中示出的4个计算子单元,即计算子单元1~计算子单元4。进一步地,每个计算子单元可以从L1高速缓存区中获取矩阵乘运算所需的原子矩阵,也即计算子单元所支持的矩阵乘操作的最小矩阵单元。
如上文结合图7所描述的,每个主计算单元在获得对应的结果矩阵子块时,需要完成两次矩阵乘运算,其中每次矩阵乘运算也即图8中所示出的由4个计算子单元所完成的第一轮计算和第二轮计算。如图8中所示,在第一轮计算中,计算子单元1从L1高速缓存区中获取
Figure PCTCN2022086815-appb-000044
Figure PCTCN2022086815-appb-000045
以执行矩阵乘操作。接着,在第二轮计算中,计算子单元1从L1高速缓存区中获取
Figure PCTCN2022086815-appb-000046
Figure PCTCN2022086815-appb-000047
以执行矩阵乘操作。最终,通过将两轮计算的中间结果
Figure PCTCN2022086815-appb-000048
相加,以获得计算子单元1执行原子矩阵的矩阵乘操作的中间结果。类似地情况也适用于计算子单元2~计算子单元4,此处不再赘述。
上面结合附图对本披露的优化方案及其结合硬件架构的应用进行了详细地描述,下面将讨论本披露的算法择优方案。这里,算法择优方案即是指从多个适用于矩阵乘操作的矩阵乘算法中选择出最优的算法来执行矩阵乘操作。在一个实施方式中,可以通过在片上系统进行不同的拆分方式来设置不同的搜索空间,从而最终获得不同的矩阵乘算法。例如, 当在计算子单元层级执行拆分操作时,可以选择仅对第一矩阵的矩阵子块进行拆分(例如在M行方向上进行拆分)而对第二矩阵的矩阵子块不进行拆分,从而获得对应的搜索空间,并最终形成新的矩阵乘算法。类似地,也可以对第二矩阵的矩阵子块进行拆分(例如在N列方向上进行拆分)而对第一矩阵的矩阵子块不进行拆分,从而形成另一新的矩阵乘算法。可以看出,前述得到的两个矩阵乘算法都未在K方向(对于第一矩阵为列方向并且对于第二矩阵是行方向)上进行计算子单元层级的拆分。
当存在上述类似的多个候选的、实现矩阵乘操作的矩阵乘算法时,由于这些算法个数是有限的,从而可以构成一个有限的算法空间F={f 0,f 1,f 2,……,f n}。接着,可以在该算法空间内设置全局的优化目标为下式:
Figure PCTCN2022086815-appb-000049
其中M、K、N、M b、K b和N b与前文多个表达式(如式2)中的对应各项具有相同的含义。
基于上述场景,下面结合图9来对如何选取最优的矩阵乘算法进行详细地描述。
图9是示出根据本披露实施例的用于选择最优矩阵乘算法的方法900的流程图。如图9中所示,在步骤S902处,可以确定代价函数,该代价函数的确定方式可以是前文结合图2描述的方式,此处不再赘述。作为示例,可以确定如下的代价函数:
Figure PCTCN2022086815-appb-000050
类似于式(16),式(17)中各项符号与前文多个表达式(如式5)中的符号具有相同的含义。
接着,在步骤S904处,确定多个矩阵乘算法(即上文的多个“候选算法”)中的每个矩阵乘算法的搜索空间,并且在步骤S906处,确定该搜索空间的搜索步长。该搜索空间和搜索步长的确定方式可以参照前文,此处不再赘述。接着,在步骤S908处,利用搜索算法(例如前述的全局搜索、邻域搜索或遗传算法),并且以确定的搜索步长进行搜索,从而在步骤S910处确定每个矩阵乘算法所对应的拆分系数(例如针对于第i个算法的拆分系数M bi,K bi和N bi)。接着,可以在步骤S912处计算该每个矩阵乘算法的代价函数值,并且在S914步骤处确定具有最小代价函数值的矩阵乘算法。由此,在步骤S916处,选择具有最小代价函数值的矩阵乘算法为最优的矩阵乘算法,并且使用其对应的拆分系数来对大矩阵实现多级的拆分操作。
通过本披露的上述算法择优方案,可以从多个用于矩阵乘操作的算法中选择出最优的一个算法。该选择出的算法能够以最小的运算代价来实现对大矩阵的相乘操作,从而提高了矩阵乘操作的运算效率并减小了计算成本。进一步,当在片上系统利用前述的最优算法进行矩阵乘操作时,也最大化片上系统的资源使用,从而充分发挥片上系统的运算能力。
图10是示出根据本披露实施例的一种组合处理装置1000的结构图。如图11中所示,该组合处理装置1000包括计算处理装置1002、接口装置1004、其他处理装置1006和存储装置1008。根据不同的应用场景,计算处理装置中可以包括一个或多个集成电路装置1010,该集成电路装置可以包括本披露上下文中描述的片上系统,该片上系统用于执行矩阵间的矩阵乘操作。在一个实施场景中,该矩阵可以是大矩阵或超大矩阵。进一步,通过本披露上下文中所讨论的优化方案,前述的大矩阵或超大矩阵(如本披露前述的第一矩阵 和第二矩阵)可以经由拆分系数进行拆分,从而获得适用于片上系统执行矩阵乘操作的矩阵块。
在不同的实施例中,本披露的计算处理装置可以配置成执行用户指定的操作,例如本披露的矩阵乘操作。在示例性的应用中,该计算处理装置可以实现为(或包括)单核人工智能处理器或者多核人工智能处理器。类似地,包括在计算处理装置内的一个或多个计算装置可以实现为人工智能处理器核或者人工智能处理器核的部分硬件结构。当多个计算装置实现为人工智能处理器核或人工智能处理器核的部分硬件结构时,就本披露的计算处理装置而言,其可以视为具有单核结构或者同构多核结构。
在示例性的操作中,本披露的计算处理装置可以通过接口装置与其他处理装置进行交互,以共同完成用户指定的操作。根据实现方式的不同,本披露的其他处理装置可以包括中央处理器(Central Processing Unit,CPU)、图形处理器(Graphics Processing Unit,GPU)、人工智能处理器等通用和/或专用处理器中的一种或多种类型的处理器。这些处理器可以包括但不限于数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,并且其数目可以根据实际需要来确定。如前所述,仅就本披露的计算处理装置而言,其可以视为具有单核结构或者同构多核结构。然而,当将计算处理装置和其他处理装置共同考虑时,二者可以视为形成异构多核结构。
在一个或多个实施例中,该其他处理装置可以作为本披露的计算处理装置(其可以具体化为人工智能例如神经网络运算的相关运算装置)与外部数据和控制的接口,执行包括但不限于数据搬运、对计算装置的开启和/或停止等基本控制。在另外的实施例中,其他处理装置也可以和该计算处理装置协作以共同完成运算任务。
在一个或多个实施例中,该接口装置可以用于在计算处理装置与其他处理装置间传输数据和控制指令。例如,该计算处理装置可以经由所述接口装置从其他处理装置中获取输入数据,写入该计算处理装置片上的存储装置(或称存储器)。进一步,该计算处理装置可以经由所述接口装置从其他处理装置中获取控制指令,写入计算处理装置片上的控制缓存中。替代地或可选地,接口装置也可以读取计算处理装置的存储装置中的数据并传输给其他处理装置。附加地或可选地,本披露的组合处理装置还可以包括存储装置。如图中所示,该存储装置分别与所述计算处理装置和所述其他处理装置连接。在一个或多个实施例中,存储装置可以用于保存所述计算处理装置和/或所述其他处理装置的数据。例如,该数据可以是在计算处理装置或其他处理装置的内部或片上存储装置中无法全部保存的数据。
在一些实施例里,本披露还公开了一种芯片(例如图11中示出的芯片1102)。在一种实现中,该芯片是一种系统级芯片(System on Chip,SoC),并且集成有一个或多个如图10中所示的组合处理装置,并且其可以配置成执行矩阵块间的矩阵乘操作。该芯片可以通过对外接口装置(如图11中示出的对外接口装置1106)与其他相关部件相连接。该相关部件可以例如是摄像头、显示器、鼠标、键盘、网卡或wifi接口。在一些应用场景中,该芯片上可以集成有其他处理单元(例如视频编解码器)和/或接口模块(例如DRAM接口)等。在一些实施例中,本披露还公开了一种芯片封装结构,其包括了上述芯片。在一些实施例里,本披露还公开了一种板卡,其包括上述的芯片封装结构。下面将结合图11 对该板卡进行详细地描述。
图11是示出根据本披露实施例的一种板卡1100的结构示意图。如图11中所示,该板卡包括用于存储数据的存储器件1104,其包括一个或多个存储单元1110。该存储器件可以通过例如总线等方式与控制器件1108和上文所述的芯片1102进行连接和数据传输。进一步,该板卡还包括对外接口装置1106,其配置用于芯片(或芯片封装结构中的芯片)与外部设备1112(例如服务器或计算机等)之间的数据中继或转接功能。例如,待处理的数据可以由外部设备通过对外接口装置传递至芯片。又例如,所述芯片的计算结果可以经由所述对外接口装置传送回外部设备。根据不同的应用场景,所述对外接口装置可以具有不同的接口形式,例如其可以采用标准PCIE接口等。在一个或多个实施例中,本披露板卡中的控制器件可以配置用于对所述芯片的状态进行调控。为此,在一个应用场景中,该控制器件可以包括单片机(Micro Controller Unit,MCU),以用于对所述芯片的工作状态进行调控。
根据上述结合图10和图11的描述,本领域技术人员可以理解本披露也公开了一种电子设备或装置,其可以包括一个或多个上述板卡、一个或多个上述芯片和/或一个或多个上述组合处理装置。在一个实施场景中,该电子设备或装置可以配置成执行本披露上下文所讨论的矩阵乘运算操作,并且参与矩阵乘运算的矩阵数据是经本披露的最优拆分系数拆分后所得到的矩阵块。
根据不同的应用场景,本披露的电子设备或装置可以包括服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本披露的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。
进一步,本披露的电子设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中,根据本披露方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器),而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中,云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容,从而可以根据终端设备和/或边缘端设备的硬件信息,从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源,以便完成端云一体或云边端一体的统一管理、调度和协同工作。
需要说明的是,为了简明的目的,本披露将一些方法及其实施例表述为一系列的动作及其组合,但是本领域技术人员可以理解本披露的方案并不受所描述的动作的顺序限制。因此,依据本披露的公开或教导,本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步,本领域技术人员可以理解本披露所描述的实施例可以视为可选实施例,即其中所涉及的动作或模块对于本披露某个或某些方案的实现并不一定是必需的。另外,根据方案的不同,本披露对一些实施例的描述也各有侧重。鉴于此,本 领域技术人员可以理解本披露某个实施例中没有详述的部分,也可以参见其他实施例的相关描述。
在具体实现方面,基于本披露的公开和教导,本领域技术人员可以理解本披露所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如,就前文所述的电子设备或装置实施例中的各个单元来说,本文在考虑了逻辑功能的基础上对其进行划分,而实际实现时也可以有另外的划分方式。又例如,可以将多个单元或组件结合或者集成到另一个系统,或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言,前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中,前述的直接或间接耦合涉及利用接口的通信连接,其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。
在本披露中,作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外,根据实际的需要,可以选择其中的部分或者全部单元来实现本披露实施例所述方案的目的。另外,在一些场景中,本披露实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。
在一些实现场景中,上述集成的单元可以采用软件程序模块的形式来实现。如果以软件程序模块的形式实现并作为独立的产品销售或使用时,所述集成的单元可以存储在计算机可读取存储器中。基于此,当本披露的方案以软件产品(例如计算机可读存储介质)的形式体现时,该软件产品可以存储在存储器中,其可以包括若干指令用以使得计算机设备(例如个人计算机、服务器或者网络设备等)执行本披露实施例所述方法的部分或全部步骤。前述的存储器可以包括但不限于U盘、闪存盘、只读存储器(Read Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
在另外一些实现场景中,上述集成的单元也可以采用硬件的形式实现,即为具体的硬件电路,其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件,而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此,本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现,例如CPU、GPU、FPGA、DSP和ASIC等。进一步,前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等),其例如可以是可变电阻式存储器(Resistive Random Access Memory,RRAM)、动态随机存取存储器(Dynamic Random Access Memory,DRAM)、静态随机存取存储器(Static Random Access Memory,SRAM)、增强动态随机存取存储器(Enhanced Dynamic Random Access Memory,EDRAM)、高带宽存储器(High Bandwidth Memory,HBM)、混合存储器立方体(Hybrid Memory Cube,HMC)、ROM和RAM等。
还应当理解,本披露示例的执行指令的任何模块、单元、组件、服务器、计算机、终端或设备可以包括或以其他方式访问计算机可读介质,诸如存储介质、计算机存储介质或数据存储设备(可移除的和/或不可移除的)例如磁盘、光盘或磁带。计算机存储介质可以包括以用于存储信息的任何方法或技术实现的易失性和非易失性,可移动和不可移动介质,例如计算机可读指令、数据结构、程序模块或其他数据。
应当理解,本披露的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第 四”等是用于区别不同对象,而不是用于描述特定顺序。本披露的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、组件和/或其集合的存在或添加。
还应当理解,在此本披露说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本披露。如在本披露说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本披露说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。
虽然本发明的实施方式如上,但所述内容只是为便于理解本发明而采用的实施例,并非用以限定本发明的范围和应用场景。任何本发明所述技术领域内的技术人员,在不脱离本发明所揭露的精神和范围的前提下,可以在实施的形式上及细节上作任何的修改与变化,但本发明的专利保护范围,仍须以所附的权利要求书所界定的范围为准。

Claims (15)

  1. 一种用于优化片上系统的矩阵乘操作的方法,所述方法由一个或多个处理器实现,并且包括:
    接收待拆分以便执行矩阵乘操作的第一矩阵和第二矩阵的矩阵信息,其中所述第一矩阵为M行×K列并且所述第二矩阵为K行×N列;以及
    通过最小化代价函数来确定用于拆分第一矩阵和第二矩阵的拆分系数,所述拆分系数包括用于拆分第一矩阵后获得的矩阵块的行数M b和列数K b以及拆分第二矩阵后获得的矩阵块的行数K b和列数N b
    其中所述代价函数用于确定在所述片上系统和片外系统之间传递矩阵数据对在所述片上系统执行所述矩阵乘操作所产生的代价,
    其中所述代价函数至少基于所述第一矩阵的数据规模大小、所述第二矩阵的数据规模大小、所述第一矩阵的行数M、所述第二矩阵的列数N和所述拆分系数。
  2. 根据权利要求1所述的方法,其中所述代价函数还基于带宽利用率系数,其中所述带宽利用率系数等于按预定数据长度从所述片外系统加载矩阵块时的等效带宽与片上系统和片外系统之间的总带宽之间的比值。
  3. 根据权利要求1或2所述的方法,其中在通过最小化代价函数来确定所述拆分系数中,所述方法包括建立用于最小化所述代价函数的搜索空间,以便利用所述搜索空间确定所述拆分系数。
  4. 根据权利要求3所述的方法,其中建立用于最小化所述代价函数的搜索空间包括:
    对所述片上系统的高速缓冲区进行划分;以及
    根据划分结果建立所述搜索空间,其中所述高速缓冲区被布置用于存储拆分后的矩阵块和对所述拆分后的矩阵块执行矩阵乘操作所获得的矩阵乘结果。
  5. 根据权利要求4所述的方法,其中所述片上系统包括多级高速缓存区,并且所述方法包括:
    根据用于执行矩阵乘操作的预定矩阵乘算法,建立与每一级高速缓存区关联的搜索子空间。
  6. 根据权利要求5所述的方法,其中所述多级高速缓存区包括一级高速缓存区和二级高速缓存区,并且所述搜索空间包括第一搜索子空间和第二搜索子空间,所述方法包括:
    根据所述一级高速缓存区中的多个第一高速缓冲区的设置来建立所述第一搜索子空间,其中所述多个第一高速缓冲区用于存储拆分所述矩阵块所获得的矩阵子块和对所述矩阵子块执行矩阵乘操作所获得的中间运算结果;以及
    根据所述二级高速缓存区的多个第二高速缓冲区的设置来建立所述第二搜索子空间,其中所述多个第二高速缓冲区用于存储拆分所述矩阵子块所获得的原子矩阵和对所述原子矩阵执行矩阵乘操作所获得的中间运算结果。
  7. 根据权利要求6所述的方法,其中通过最小化代价函数来确定所述拆分系数包括:
    确定用于对所述搜索空间进行搜索的搜索步长,其中所述搜索步长包括分别与M,K,N维度关联的搜索步长Δm、Δk和Δn;以及
    利用搜索算法在所述搜索空间中以所述搜索步长进行搜索,以确定最小化所述代价函数的M b、K b和N b
  8. 根据权利要求7所述的方法,其中所述矩阵信息包括参与所述矩阵乘操作的主计 算单元的数目、每个所述主计算单元内的处理子单元的数目和从所述片外系统加载达到最高带宽利用率的数据规模大小,在确定所述搜索步长中,所述方法包括:
    至少根据所述主计算单元的数目、处理子单元的数目和所述数据规模大小来确定所述搜索步长。
  9. 根据权利要求8所述的方法,其中所述矩阵信息还包括所述第一矩阵和第二矩阵在所述片外系统中的存储格式以及是否经过转置操作的转置信息,其中所述存储格式包括以行主序或列主序进行存储,在确定所述搜索步长中,所述方法还包括:
    根据所述第一矩阵和第二矩阵的所述存储格式和转置信息来确定所述搜索步长。
  10. 根据权利要求5所述的方法,还包括:
    利用多个候选矩阵乘算法来获得多个搜索空间;
    根据所述多个搜索空间和所述代价函数来获得与每个所述候选矩阵乘算法关联的代价函数值;以及
    根据多个所述代价函数值的比较,从所述多个候选矩阵乘算法中选择具有最小代价函数值的候选算法作为所述预定矩阵乘算法。
  11. 一种用于优化片上系统的矩阵乘操作的设备,包括:
    处理器;
    存储器,其存储有用于优化片上系统的矩阵乘操作的程序指令,当所述程序指令由所述处理器执行时,使得所述设备执行根据权利要求1-10的任意一项所述的方法。
  12. 一种计算机可读存储介质,其上存储有用于优化片上系统的矩阵乘操作的程序指令,当所述程序指令由处理器执行时,实现根据权利要求1-10的任意一项所述的方法。
  13. 一种用于执行矩阵乘操作的片上系统,包括:
    多个主处理单元,其中每个主处理单元包括多个处理子单元,其中每个处理子单元用于执行对应的矩阵乘操作;
    多个高速缓存,用于缓存待执行矩阵乘操作的矩阵数据和与矩阵乘操作相关联的结果,
    其中所述片上系统配置成执行矩阵块间的矩阵乘操作,并且所述矩阵块是根据权利要求1-10中任意一项所述方法的拆分系数来对矩阵进行拆分而获得的。
  14. 一种集成电路装置,包括根据权利要求13所述的片上系统。
  15. 一种板卡,包括根据权利要求14所述的集成电路装置。
PCT/CN2022/086815 2021-04-16 2022-04-14 用于优化片上系统的矩阵乘操作的方法和相关产品 WO2022218374A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP22787596.0A EP4325373A1 (en) 2021-04-16 2022-04-14 Method for optimizing matrix multiplication operation on system on chip, and related product
US18/374,817 US20240028666A1 (en) 2021-04-16 2023-09-29 Method for optimizing matrix multiplication operation on system on chip, and related product

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110414133.3 2021-04-16
CN202110414133.3A CN115221101B (zh) 2021-04-16 2021-04-16 用于优化片上系统的矩阵乘操作的方法和相关产品

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/374,817 Continuation US20240028666A1 (en) 2021-04-16 2023-09-29 Method for optimizing matrix multiplication operation on system on chip, and related product

Publications (1)

Publication Number Publication Date
WO2022218374A1 true WO2022218374A1 (zh) 2022-10-20

Family

ID=83605184

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/086815 WO2022218374A1 (zh) 2021-04-16 2022-04-14 用于优化片上系统的矩阵乘操作的方法和相关产品

Country Status (4)

Country Link
US (1) US20240028666A1 (zh)
EP (1) EP4325373A1 (zh)
CN (1) CN115221101B (zh)
WO (1) WO2022218374A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102375721A (zh) * 2010-08-23 2012-03-14 联想(北京)有限公司 一种矩阵乘法运算方法、图形处理器和电子设备
US20120316886A1 (en) * 2011-06-08 2012-12-13 Ramin Pishehvar Sparse coding using object exttraction
CN111191699A (zh) * 2019-12-22 2020-05-22 中国人民解放军陆军工程大学 基于非负矩阵分解和划分自适应融合的多视角聚类方法
CN112541159A (zh) * 2020-09-30 2021-03-23 华为技术有限公司 一种模型训练方法及相关设备

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0425296A3 (en) * 1989-10-27 1992-10-14 Texas Instruments Incorporated Speedup for solution of systems of linear equations
US7792895B1 (en) * 2006-06-16 2010-09-07 Nvidia Corporation Efficient matrix multiplication on a parallel processing device
US20170116156A1 (en) * 2015-10-22 2017-04-27 International Business Machines Corporation Parallelizing matrix factorization across hardware accelerators
US10032247B2 (en) * 2016-06-22 2018-07-24 Palo Alto Research Center Incorporated System and method for speeding up general matrix-vector multiplication on GPU
CN108090028B (zh) * 2017-12-15 2021-03-23 中科寒武纪科技股份有限公司 一种计算方法及相关产品
KR101990735B1 (ko) * 2018-03-30 2019-06-18 서울대학교산학협력단 사전 그래프 분할 기반 행렬 벡터 곱을 이용한 대규모 그래프 마이닝 방법 및 장치
JP7132043B2 (ja) * 2018-09-10 2022-09-06 東京計器株式会社 リコンフィギュラブルプロセッサ
CN111028136B (zh) * 2019-12-24 2023-04-07 上海寒武纪信息科技有限公司 一种人工智能处理器处理二维复数矩阵的方法和设备
CN111125628A (zh) * 2019-12-24 2020-05-08 上海寒武纪信息科技有限公司 人工智能处理器处理二维数据矩阵的方法和设备
CN111523642B (zh) * 2020-04-10 2023-03-28 星宸科技股份有限公司 用于卷积运算的数据重用方法、运算方法及装置、芯片
CN112416433B (zh) * 2020-11-24 2023-01-17 中科寒武纪科技股份有限公司 一种数据处理装置、数据处理方法及相关产品

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102375721A (zh) * 2010-08-23 2012-03-14 联想(北京)有限公司 一种矩阵乘法运算方法、图形处理器和电子设备
US20120316886A1 (en) * 2011-06-08 2012-12-13 Ramin Pishehvar Sparse coding using object exttraction
CN111191699A (zh) * 2019-12-22 2020-05-22 中国人民解放军陆军工程大学 基于非负矩阵分解和划分自适应融合的多视角聚类方法
CN112541159A (zh) * 2020-09-30 2021-03-23 华为技术有限公司 一种模型训练方法及相关设备

Also Published As

Publication number Publication date
CN115221101A (zh) 2022-10-21
EP4325373A1 (en) 2024-02-21
CN115221101B (zh) 2023-12-19
US20240028666A1 (en) 2024-01-25

Similar Documents

Publication Publication Date Title
CN110298443B (zh) 神经网络运算装置及方法
US10936941B2 (en) Efficient data access control device for neural network hardware acceleration system
TWI735545B (zh) 一種模型的訓練方法和裝置
CN109522052B (zh) 一种计算装置及板卡
WO2022218373A1 (zh) 用于优化片上系统的卷积运算操作的方法和相关产品
WO2023093623A1 (zh) 计算图的优化方法、数据处理方法及相关产品
CN111898698B (zh) 对象的处理方法及装置、存储介质和电子设备
CN112799726A (zh) 数据处理装置、方法及相关产品
TWI775210B (zh) 用於卷積運算的資料劃分方法及處理器
WO2021036362A1 (zh) 用于处理数据的方法、装置以及相关产品
CN111124995A (zh) 通过人工智能处理器处理一维复数数组的方法和设备
WO2022012233A1 (zh) 一种量化校准方法、计算装置和计算机可读存储介质
WO2021185262A1 (zh) 计算装置、方法、板卡和计算机可读存储介质
WO2022218374A1 (zh) 用于优化片上系统的矩阵乘操作的方法和相关产品
WO2021082725A1 (zh) Winograd卷积运算方法及相关产品
US10915470B2 (en) Memory system
TW202145078A (zh) 具有動態最小批次尺寸之運算方法,以及用於執行該方法之運算系統及電腦可讀儲存媒體
TW202008172A (zh) 儲存系統
CN114003198B (zh) 内积处理部件、任意精度计算设备、方法及可读存储介质
WO2022143799A1 (zh) 用于矩阵乘操作的集成电路装置、计算设备、系统和方法
WO2021082747A1 (zh) 运算装置及相关产品
CN112817898A (zh) 数据传输方法、处理器、芯片及电子设备
WO2022001438A1 (zh) 一种计算装置、集成电路芯片、板卡、设备和计算方法
WO2022199680A1 (zh) 数据处理装置、方法及相关产品
US20220222041A1 (en) Method and apparatus for processing data, and related product

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22787596

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022787596

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2022787596

Country of ref document: EP

Effective date: 20231116