US20240028666A1 - Method for optimizing matrix multiplication operation on system on chip, and related product - Google Patents
Method for optimizing matrix multiplication operation on system on chip, and related product Download PDFInfo
- Publication number
- US20240028666A1 US20240028666A1 US18/374,817 US202318374817A US2024028666A1 US 20240028666 A1 US20240028666 A1 US 20240028666A1 US 202318374817 A US202318374817 A US 202318374817A US 2024028666 A1 US2024028666 A1 US 2024028666A1
- Authority
- US
- United States
- Prior art keywords
- matrix
- search
- chip system
- splitting
- sub
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 239000011159 matrix material Substances 0.000 title claims abstract description 508
- 238000000034 method Methods 0.000 title claims abstract description 47
- 238000003860 storage Methods 0.000 claims abstract description 51
- 230000006870 function Effects 0.000 claims description 51
- 239000000872 buffer Substances 0.000 claims description 29
- 230000015654 memory Effects 0.000 claims description 29
- 108010001267 Protein Subunits Proteins 0.000 claims description 7
- 230000017105 transposition Effects 0.000 claims description 7
- 238000012545 processing Methods 0.000 abstract description 64
- 230000005540 biological transmission Effects 0.000 abstract description 5
- 238000010586 diagram Methods 0.000 description 16
- 238000013473 artificial intelligence Methods 0.000 description 11
- 238000012546 transfer Methods 0.000 description 10
- 238000005457 optimization Methods 0.000 description 6
- 230000009471 action Effects 0.000 description 5
- 238000010845 search algorithm Methods 0.000 description 5
- 230000014509 gene expression Effects 0.000 description 4
- 230000004044 response Effects 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 238000010168 coupling process Methods 0.000 description 2
- 238000005859 coupling reaction Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000002068 genetic effect Effects 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000005481 NMR spectroscopy Methods 0.000 description 1
- 240000007594 Oryza sativa Species 0.000 description 1
- 235000007164 Oryza sativa Nutrition 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 235000009566 rice Nutrition 0.000 description 1
- 230000008054 signal transmission Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 238000005406 washing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
- G06F15/7807—System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
- G06F15/781—On-chip cache; Off-chip memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/52—Multiplying; Dividing
- G06F7/523—Multiplying only
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/12—Computing arrangements based on biological models using genetic models
- G06N3/126—Evolutionary algorithms, e.g. genetic algorithms or genetic programming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/01—Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
Definitions
- the present disclosure generally relates to the field of data computing. More specifically, the present disclosure relates to a method for optimizing matrix multiplication of an on-chip system, a device, and a computer-readable storage medium.
- Matrix multiplication is a very common data operation in the field of scientific computing and data processing.
- a large number of data computing are usually involved, including matrix multiplication of various types of data.
- DNN deep neural network
- RNN recurrent neural network
- NLP natural language processing
- a processor such as a central processing unit (CPU) or a graphics processing unit (GPU) is usually used.
- CPU central processing unit
- GPU graphics processing unit
- I/O input/output
- not only the limited bandwidth of the I/O bus will become the bottleneck of system performance, but also the large amount of I/O access between the processor and the external memory will bring adverse effects on computing and power consumption. Therefore, how to optimize matrix access becomes a very important means to improve the performance of general matrix multiplication.
- the present disclosure provides a solution that optimizes matrix multiplication of an on-chip system. Specifically, the present disclosure provides an optimal method for determining matrix splitting in matrix multiplication. By using an optimal splitting method to split a matrix, the matrix multiplication disclosed in the present disclosure significantly reduces the amount of data transmission with an external memory, thereby reducing the I/O bottleneck caused by the limited bandwidth of the bus, and then improving the operation efficiency of the matrix multiplication. In view of this, the present disclosure provides the foregoing solution in following aspects.
- a first aspect of the present disclosure discloses a method for performing a matrix multiplication on an on-chip system, where the method is implemented by at least one processor and includes: receiving matrix information of a first matrix and a second matrix, where the first matrix is M rows ⁇ K columns and the second matrix is K rows N columns; and determining splitting coefficients for splitting the first matrix and the second matrix by optimizing a cost function, where the splitting coefficients include a row count M b and a column count K b of matrix blocks obtained after splitting the first matrix and a row count K b and a column count N b of matrix blocks obtained after splitting the first matrix, where the cost function is indicative of determine the cost of transferring matrix data between the on-chip system and an off-chip system to perform the matrix multiplication of the first matrix and the second matrix on the on-chip system.
- the cost function is at least based on a data size of the first matrix, a data size of the second matrix, a row count M of the first matrix, a column count N of the second matrix, and the splitting
- a second aspect of the present disclosure discloses a device configured to perform a matrix multiplication of an on-chip system, including: at least one processor; and a memory, on which a program instruction for performing the matrix multiplication on an on-chip system is stored, where when the program instruction is performed by the processor, the device performs the above method.
- a third aspect of the present disclosure discloses a computer-readable storage medium, on which a program instruction for performing the matrix multiplication of an on-chip system is stored, where when the program instruction is performed by a processor, the above method is performed.
- a fourth aspect of the present disclosure discloses an on-chip system for performing a matrix multiplication, including: a plurality of master computingunits, where each master computing unit includes a plurality of computing sub-units, where each computing sub-unit is configured to perform corresponding matrix multiplication; and a plurality of caches, which are configured to cache matrix data that is to perform a matrix multiplication and results associated with matrix multiplication, where the on-chip system is configured to perform a matrix multiplication between matrix blocks, and the matrix blocks are obtained by splitting a matrix according to the above method.
- a fifth aspect of the present disclosure discloses an integrated circuit apparatus, including the above on-chip system.
- a sixth aspect of the present disclosure discloses a board card, including the above integrated circuit apparatus.
- an optimal splitting method for a matrix participating in matrix multiplication may be determined, thereby significantly optimizing matrix multiplication.
- the solution of the present disclosure selects optimal splitting coefficients for splitting matrices. Therefore, through matrix multiplication performed based on the optimal splitting coefficients, the solution of the present disclosure may make full use of on-chip resources of the on-chip system and reduce I/O data interaction with an external memory of the off-chip system, thus achieving efficient parallel execution of data transfer and the matrix multiplication.
- the solution of the present disclosure also simplifies the complexity of the matrix multiplication and supports the matrix multiplication of super-large matrices.
- the solution of the present disclosure may also select an optimal matrix multiplication algorithm from a plurality of candidate matrix multiplication algorithms to realize the efficient execution of the matrix multiplication.
- FIG. 1 is a schematic diagram of matrix splitting according to an embodiment of the present disclosure.
- FIG. 2 is a flowchart of a method for optimizing matrix multiplication of an on-chip system according to an embodiment of the present disclosure.
- FIG. 3 is an architecture diagram for performing matrix access according to an embodiment of the present disclosure.
- FIG. 4 is a schematic architecture diagram of L2 caching area shown in FIG. 3 according to an embodiment of the present disclosure.
- FIG. 5 is a schematic architecture diagram of L1 caching area shown in FIG. 3 according to an embodiment of the present disclosure.
- FIG. 6 a and FIG. 6 b are schematic diagrams of matrix block splitting principle according to embodiments of the present disclosure.
- FIG. 7 is a structural diagram of an on-chip system that performs a matrix multiplication according to an embodiment of the present disclosure.
- FIG. 8 is a schematic diagram where a computing sub-unit performs a matrix multiplication according to an embodiment of the present disclosure.
- FIG. 9 is a flowchart of a method for selecting an optimal matrix multiplication algorithm according to an embodiment of the present disclosure.
- FIG. 10 shows a structural diagram of a combined processing apparatus according to an embodiment of the present disclosure.
- FIG. 11 is a schematic structural diagram of a board card according to an embodiment of the present disclosure.
- the inventor of the present disclosure finds that when two matrices are split in an arbitrary form to perform a matrix multiplication, this splitting action does not significantly change the total computing amount of multiplication and addition in the matrix multiplication. However, this splitting action significantly changes the amount of I/O between an on-chip system and an off-chip system. Therefore, optimizing the amount of I/O between the on-chip system and the off-chip system is a key to improve matrix multiplication performance. In view of this, in order to improve the performance of data access between the on-chip system and the off-chip system in the matrix multiplication, increase the operation efficiency of the matrix multiplication, and reduce operation cost, the present disclosure proposes a solution for optimizing matrix multiplication, which involves determining splitting coefficients for splitting a large matrix.
- the large matrices may be considered to be split into blocks, and each block (“matrix block” in the context of the present disclosure) is regarded as an element of the matrix, and the matrix multiplication is performed on the basis of that element.
- matrix block in the context of the present disclosure
- general matrix multiplication may be converted into block matrix multiplication.
- multiplication between large matrices may be made clearer and more explicit, thus greatly simplifying computing.
- block matrix multiplication is also an important means to solve the general matrix multiplication problem of the on-chip system.
- the on-chip system may only multiply two matrix blocks obtained after splitting each time, so that the matrix multiplication may be adapted to the limited storage resources and computing resources.
- FIG. 1 takes FIG. 1 as an example to illustrate the above matrix splitting.
- FIG. 1 is a schematic diagram of splitting a matrix according to an embodiment of the present disclosure, where an upper part shows matrices before splitting, and a lower part shows matrices after splitting, as shown by arrows in the figure.
- FIG. 1 shows that matrix multiplication is performed on a matrix A (“first matrix” of the present disclosure) and a matrix B (“second matrix” of the present disclosure), thus obtaining a matrix C as a result matrix.
- the matrix A and the matrix B shown in the upper part of FIG. 1 may be split, where a grid box in the matrix A and the matrix B represents an exemplary matrix block.
- a matrix block A 11 (composed of elements including a 11 , a 12 , a 21 , a 22 , and the like) of the matrix A and a matrix block B 11 (composed of elements including b 11 , b 12 , b 21 , b 22 , and the like) of the matrix B may be obtained as shown in the lower part of the figure, where each matrix block acts as a new element in the split matrix.
- each matrix block of the matrix A, the matrix B, and the matrix C after splitting is represented as: A block (whose size is M b rows*K b columns), B block (whose size is K b rows*N b columns), and C block (whose size is M b rows*N b columns).
- the present disclosure proposes a solution for determining an optimal matrix block to achieve optimal splitting (or block) of the matrix.
- M b , K b , and N b may be determined.
- the first matrix and the second matrix are split by using the determined MAb.
- K b and N b (which are splitting coefficients in the context of the present disclosure).
- the solution of the present disclosure may simplify the matrix multiplication, minimize the I/O bottleneck caused by the bandwidth limitation of the on-chip system, and then improve the operation efficiency of the matrix multiplication.
- FIG. 2 is a flowchart of a method 200 for optimizing matrix multiplication of an on-chip system according to an embodiment of the present disclosure.
- the on-chip system is usually a complete system integrated on a single chip.
- This system may generally include various units, such as a system-on-chip control logic unit, a microprocessor/micro-controller central processing unit (CPU) kernel unit, an embedded memory unit, and an interface unit for communicating with an off-chip system.
- the on-chip system described herein may be a system-on-chip that supports matrix multiplication, including a plurality of master computing units for performing a matrix multiplication and a memory for storing matrix data and matrix multiplication results.
- the plurality of master computing units may be connected in turn to form a data transfer loop, and each master computing unit may include a plurality of computing sub-units, thereby achieving matrix splitting at the master computing unit level and quadratic matrix splitting at the computing sub-unit level, which are multilevel matrix splitting.
- splitting coefficients disclosed herein are also related to the number of master computing units of the on-chip system and the number of computing sub-units contained in each master unit. The exemplary connection and arrangement of the master computing units and the computing sub-units are described in detail later in conjunction with drawings.
- the method 200 for optimizing the matrix multiplication of the on-chip system of the present disclosure includes: in step S 202 , receiving matrix information of a first matrix (such as the matrix A in FIG. 1 ) and a second matrix (such as the matrix B in FIG. 1 ) that are to be split to perform a matrix multiplication.
- the above matrix information may include a data size and data information of a matrix.
- the matrix information may indicate that the first matrix is a large matrix with M rows*K columns and the second matrix is a large matrix with K rows*N columns.
- the matrix information includes a data size of each element in the first matrix and/or the second matrix (for example, the data size may be in bits or bytes). It may be understood that since splitting is required, M. K. and N may represent relatively large positive integers, such as 256, 512, 1024, or 2048, so that the solution disclosed herein may be applied to splitting and multiplication of large matrices.
- splitting coefficients for splitting the first matrix and the second matrix are determined by minimizing a cost function.
- the above splitting coefficients may include a size of a matrix block (“block”) after splitting, such as a row count M b and a column count K b of matrix blocks obtained after splitting the first matrix and a row count K b and a column count N b of matrix blocks obtained after splitting the second matrix.
- the above cost function is used to determine (or measure) the cost of transferring matrix data between the on-chip system and an off-chip system (as shown in FIG. 3 ) to perform the matrix multiplication on the on-chip system.
- the on-chip system may be configured with on-chip cache memories 304 and 306 for storing matrix blocks.
- the off-chip system may be configured with a global memory 302 , which may transfer various types of data including matrix blocks to the on-chip cache memories through an I/O interface.
- the global memory may be a dynamic random access memory (“DRAM”), such as a double rate synchronous (“DDR”) dynamic random access memory.
- DRAM dynamic random access memory
- DDR double rate synchronous
- the present disclosure proposes to construct the expression form of the cost function at least based on a data size of the first matrix, a data size of the second matrix, a row count M of the first matrix, a column count N of the second matrix, and splitting coefficients (M b , K b , and N b to be determined in the solution of the present disclosure).
- the cost function of the present disclosure may be expressed as:
- the cost function of the formula (2) may be further expressed as:
- cost function may be expressed as (K in the formula (3) is omitted):
- the solution of the present disclosure also proposes to incorporate a bandwidth utilization coefficient into the cost function, where the bandwidth utilization coefficient equals to a ratio between an equivalent bandwidth when the matrix blocks are loaded from the off-chip system at a predetermined data length and a total bandwidth between the on-chip system and the off-chip system.
- a bandwidth utilization coefficient ⁇ (L) may be added to the cost function expressed in the formulas (1)-(4) above, which equals to a ratio between an equivalent bandwidth of loading the matrix blocks segment by segment at a data length “L” (such as every L element in a matrix block) and a full bandwidth.
- the equivalent bandwidth is the inverse of the time taken to load one matrix block segment by segment according to a certain data length
- the full bandwidth refers to the total bandwidth of data transmission between the on-chip system and the off-chip system, which is approximately equal to the inverse of the time taken to load the matrix blocks from the off-chip system to the on-chip system continuously at one time.
- the cost function in the formula (1) may be further expressed as follows:
- lda represents a leading dimension of the matrix A.
- ldb represents a leading dimension of the matrix B (“leading dimension” is abbreviated as “ld”), where the leading dimension refers to a row or column width of a matrix when the matrix is stored on an off-chip system in either row-major or column-major order storage format.
- lda b represents a leading dimension of a matrix block obtained after splitting the matrix A; and ldb b represents a leading dimension of a matrix block obtained after splitting the matrix B.
- the matrix when the matrix is stored row by row in the off-chip system, the matrix is in row-major order, and the leading dimension of the matrix is the row width of the matrix (which is the number of column elements).
- the matrix when the matrix is stored column by column in the off-chip system, the matrix is in column-major order, and the leading dimension of the matrix is the column width of the matrix (which is the number of row elements).
- lda b represents a splitting granularity of the matrix A in the leading dimension
- ldb b represents a splitting granularity of the matrix B in the leading dimension
- the splitting granularity refers to the number of elements when the matrix block is split in the leading dimension
- ⁇ (ldb b ) represents a ratio between an equivalent bandwidth of ldb b and a full bandwidth, where the equivalent bandwidth of ldb b is the inverse of the time taken to load one matrix block segment by segment at a data length (such as N b ) of ldb b .
- ⁇ (lda b ) represents a ratio between an equivalent bandwidth of lda b and a full bandwidth, where the equivalent bandwidth of lda b is the inverse of the time taken to load one matrix block segment by segment at a data length (such as K b ) of lda b .
- the method 200 may also include creating a search space used for minimizing the cost function, so that the splitting coefficients are determined by using the search space.
- creating the search space used for minimizing the cost function may include dividing a high-speed buffer (or a cache memory) of the on-chip system and creating the search space according to a division result.
- the high-speed buffer is configured to store split matrix blocks and matrix multiplication results obtained by performing a matrix multiplication on the split matrix blocks.
- FIG. 3 is an architecture diagram for performing matrix access according to an embodiment of the present disclosure.
- the architecture of the present disclosure includes an on-chip system and an off-chip system.
- a global memory DRAM 302 is shown by example in the off-chip system.
- the DRAM may transfer data with a L2 cache 304 through a DDR interface.
- a matrix block that is to perform a matrix multiplication is divided into matrix sub-blocks and then loaded into the L2 cache 304 .
- the L2 cache 304 may be a shared memory of the on-chip system, which is shared by a plurality of master computing units.
- the L2 cache 304 may transfer data with a plurality of L1 caches 306 , so that atomic matrices obtained by splitting the matrix block again are transferred to the L1 caches 306 accordingly.
- an atomic matrix may be viewed as a minimum matrix unit that performs a matrix multiplication supported by a computing sub-unit.
- a computing core 308 (which is the above computing sub-unit) may acquire the atomic matrices from the L1 caches 306 to perform a matrix multiplication between the atomic matrices.
- the L1 caches 306 may be viewed as private storage areas for each computing core 308 .
- the plurality of computing sub-units may form a computing master unit. For example, four computing cores 308 in FIG. 3 may form one computing master unit of the present disclosure.
- the on-chip system disclosed herein may include multiple levels of caches. Therefore, the L2 cache 304 shown in FIG. 3 may be viewed as a first level of cache, and the L1 cache 306 may be viewed as a second level of cache.
- the method of the present disclosure may include creating a search sub-space associated with each level of cache according to a predetermined matrix multiplication algorithm that is used to perform a matrix multiplication.
- a corresponding first search sub-space and a second search sub-space may be created according to the first level of cache (such as the L2 cache 304 ) and the second level of cache (such as the L1 cache 306 ).
- the method 200 of the present disclosure may further include: creating the first search sub-space according to settings of a plurality of first high-speed buffers in the first level of cache, where the plurality of first high-speed buffers are configured to store matrix sub-blocks obtained by splitting a matrix block and intermediate operation results obtained by performing a matrix multiplication on the matrix sub-blocks; and creating the second search sub-space according to settings of a plurality of second high-speed buffers in the second level of cache, where the plurality of second high-speed buffers are configured to store atomic matrices obtained by splitting a matrix sub-block and intermediate operation results obtained by performing a matrix multiplication on the atomic matrices.
- FIG. 4 and FIG. 5 are used as examples to discuss how to create the search space of the present disclosure based on two levels of caches.
- the two levels of caches are a L2 cache and a L1 cache (the L2 cache and the L1 cache shown in FIG. 3 ) respectively, and it is assumed that “cannon” algorithm is used to accelerate matrix multiplication.
- the buffer1 may be configured to receive data sent by other master computing units
- the buffer2 may load matrix data from a global memory (such as the DRAM shown in FIG. 3 )
- the buffer3 is provided to a master computing unit for transferring data to the L1 cache to enable a computing sub-unit to perform real-time computing and save intermediate results in the L1 cache.
- B block , and C block are respectively split into P 1 pieces in M, K, and N dimensions (row and column directions of two matrices) according to a first level of cannon algorithm, thus forming P 1 matrix sub-blocks, a restriction on the L2 cache may be expressed by a formula (6):
- the above first search sub-space is determined after considering cannon algorithm that is used for matrix multiplication. More generally, for any matrix multiplication algorithm, the present disclosure proposes that S 10 , S 11 , and S 13 separate buffers may be set up on the L2 cache respectively for A block of a first matrix, B block of a second matrix, and C block of a result matrix, and matrix blocks may be respectively split into P 10 , P 11 , and P 12 respectively in the M, K, and N dimensions according to matrix multiplication algorithm. Then, a restriction on the L2 cache (which is the first search sub-space of the present disclosure) may be expressed by a formula 7):
- the present disclosure sets a plurality of buffers on the L1 cache according to cannon algorithm to determine the second search sub-space of the present disclosure. Therefore, the present disclosure proposes that two separate buffers, which are the buffer1 and buffer2 shown in FIG. 5 , may be set up on the L1 cache for the matrix A and the matrix B, respectively, to be used for pipeline operations of matrix multiplication of atomic matrices (which are obtained by splitting the matrix sub-block again). In the case of pipeline operations, the buffer1 and buffer2 may alternate between receiving the atomic matrices and participating in the operation. Next, one separate buffer is set up for a matrix C that resides on the L1 cache, which is the buffer3 shown in FIG.
- each of the previously split matrix sub-blocks is further split into P 0 respectively in the M, K, and N dimensions according to a second level of cannon algorithm to obtain the atomic matrices of the present disclosure.
- a restriction on the L1 cache may be expressed by a formula (8):
- the above second search sub-space is determined after considering cannon algorithm that is used for matrix multiplication. More generally, for any matrix multiplication algorithm, the present disclosure proposes that S 00 , S 01 and S 03 separate buffers may be set up on the L1 cache respectively for A block of a first matrix, B block of a second matrix, and C block of a result matrix, and on the basis of arrangement of the L2 cache, the above matrix sub-blocks may be respectively split into P 00 , P 01 , and P 02 again respectively in the M, K, and N dimensions according to algorithm. Then, a restriction on the L1 cache (which is the second search sub-space of the present disclosure) may be expressed by a formula (9):
- the formula (8) may be viewed as a special case of the formula (9).
- the second search sub-space expressed by the formula (9) is converted to the second search sub-space expressed by the formula (8). Therefore, the formula (8) may be viewed as a special case of the formula (9).
- determining the splitting coefficients by minimizing the cost function may include determining search strides used to search the search space, where the search strides include ⁇ m, ⁇ k, and ⁇ n respectively associated with the M, K, and N dimensions. Further, a search algorithm may be used to search in the search space with the search strides to determine M b , K b , and N b used for minimizing the cost function.
- the above search strides of the present disclosure may be determined by considering following two factors:
- splitting size determining a splitting size (“tiling size”) of a matrix to be multiplied according to computing power of each of the plurality of computing sub-units (such as the “core” in FIG. 3 or the computing sub-unit in FIG. 8 ) of the on-chip system.
- the above computing power involves a matrix size that the computing sub-unit may support in the matrix multiplication, which is a size of the atomic matrix of the present disclosure.
- the splitting size refers to a size of the first or second matrix of the present disclosure that satisfies matrix multiplication requirements of the computing sub-unit after a second level of splitting.
- the atomic matrices obtained after the second level of splitting may be stored in the above L1 cache.
- splitting sizes supported by the computing sub-units may be different.
- the splitting sizes may be (8 ⁇ 8), (16 ⁇ 16), or (16*V) ⁇ (16*V), where “*” represents a multiplication sign, and V is a positive integer greater than 1.
- the splitting size supported by the computing sub-unit is (16 ⁇ 16)
- a matrix with a size of ((16*V) ⁇ (16*Q)) may also be stored on the L1 cache, where Q is a positive integer greater than 1.
- a matrix with a size of (16 ⁇ 16) may be read at a time from the L1 cache described above to perform the matrix multiplication.
- e K and e N v may be taken from values such as 4, 8, 16, 32, or 64, depending on the computing power of the computing sub-unit.
- the search strides ⁇ m, ⁇ k, and ⁇ n on the row and column dimensions of M, K, and N satisfy following equation conditions, where P 0 and P 1 have the same meaning as in the formula (8).
- computable matrix sizes on the computing sub-unit are integer multiples of e M , e K , and e N in the M, K and N dimensions, respectively.
- the search strides ⁇ m, ⁇ k and ⁇ n on the row and column dimensions of M, K and N satisfy following equation conditions, where P 12 , P 11 , P 10 , P 00 , P 01 and P 02 have the same meaning as in the formula (9).
- leading dimensions lda and ldb of the matrix are determined according to storage formats (such as column-major or row-major) and transposition methods of the matrix A and the matrix B, and at the same time, search strides of the matrix A and the matrix B in the leading dimensions are determined according to performance of the DRAM or DDR (such as the off-chip system or the global memory described in the present disclosure) for uploading and downloading data.
- storage formats such as column-major or row-major
- search strides of the matrix A and the matrix B in the leading dimensions are determined according to performance of the DRAM or DDR (such as the off-chip system or the global memory described in the present disclosure) for uploading and downloading data.
- ⁇ k scm ( P 1 ⁇ P 0 ⁇ e K ,L );
- ⁇ n scm ( P 1 ⁇ P 0 ⁇ e N ,L ).
- ⁇ k scm ( P 1 ⁇ P 0 ⁇ e K ,L );
- ⁇ n P 1 ⁇ P 0 ⁇ e N .
- ⁇ m scm ( P 1 ⁇ P 0 ,L );
- ⁇ k scm ( P 1 ⁇ P 0 ⁇ e K ,L );
- ⁇ n P 1 ⁇ P 0 ⁇ e N .
- ⁇ m scm ( P 1 ⁇ P 0 ,L );
- ⁇ k P 1 ⁇ P 0 ⁇ e K ;
- matrix information of the present disclosure includes the number of master computing units participating in matrix multiplication (which may be used, for example, to determine a size of “P 1 ” value above), the number of computing sub-units in each of the master computing units (which may be used, for example, to determine a size of “P 0 ” value above), and a data size (such as the aforementioned “L” used to determine the search strides) of loading from the off-chip system (such as the “DDR”) and achieving the highest bandwidth utilization.
- the method of the present disclosure may include determining the search strides at least based on the number of the master computing units, the number of the computing sub-units, and the data size.
- the determination of the search strides is also required to consider the storage format of the matrix and whether the matrix is transposed, so the matrix information of the present disclosure may also include storage formats of a first matrix and a second matrix in an off-chip system and transposition information about whether the matrix is transposed, where the storage formats include storage in row-major order or column-major order as described above.
- the method of the present disclosure also includes: determining the search strides according to the storage formats and transposition information of the first matrix and the second matrix, which are the above search strides ⁇ m, ⁇ k, and ⁇ n respectively associated with the M, K, and N dimensions.
- the method of the present disclosure may search for optimal splitting coefficients M b , K b , and N b , by adopting suitable search algorithms.
- the search algorithms may include, but are not limited to, a global search, a neighborhood search, a genetic algorithm, and other optimization algorithms.
- U1 in the above exemplary pseudo-code is a collection (which is the second search sub-space in the context of the present disclosure, as shown in the formula (8)) that satisfies restrictions of the L1 cache
- U2 is a collection (which is the first search sub-space in the context of the present disclosure, as shown in the formula (6)) that satisfies restrictions of the L2 cache.
- the method for optimizing the matrix multiplication of the on-chip system of the present disclosure is described above in combination with FIGS. 1 - 5 .
- optimal splitting coefficients for splitting a matrix may be determined.
- the cost in terms of data transmission (such as I/O overhead) is minimal.
- a hardware platform that performs a matrix multiplication will perform the matrix multiplication in a more efficient and less computationally expensive way.
- FIG. 6 a and FIG. 6 b are schematic diagrams of matrix block splitting principle according to embodiments of the present disclosure.
- a size of one of matrix blocks A ik block (as shown in the figure) obtained after splitting a first matrix of the present disclosure may be determined as (M b *K b ) by using splitting coefficients determined by the optimization algorithm of the present disclosure.
- a size of one of matrix blocks B kj block (as shown in the figure) obtained after splitting a second matrix is (K b *N b )
- a size of one of matrix blocks C ij block (as shown in the figure) obtained after splitting a result matrix is (M b *N b ).
- C ij block is used as an example, which equals “A ik,11 block ⁇ B kj,11 block +A ik,12 block ⁇ B kj,21 block ”, where A ik,11 block and A ik,12 block are two matrix sub-blocks obtained after splitting the matrix block A ik block described above, and B kj,11 block and B kj,21 block are two matrix sub-blocks obtained after splitting the matrix block B kj block .
- each matrix sub-block (such as “A ik,11 block ”) in the A ik block matrix may be split into four atomic matrices in the A ik,ef block matrix, which are A ik,ef,11 block , A ik,ef,12 block , A ik,ef,21 block and A ik,ef,22 block in the figure.
- a ik,ef,11 block A ik,ef,12 block
- B kj block and C ij block which is not repeated here.
- FIG. 7 is a structural block diagram of an on-chip system according to an embodiment of the present disclosure.
- this on-chip system includes a plurality of master computing units, such as master computing units 1 ⁇ 4 shown in the figure.
- the L2 cache further shown is shared by the plurality of master computing units described above.
- this L2 cache is configured with a high-speed buffer for loading matrix data from an off-chip system (such as the DDR shown in FIG. 7 ), a high-speed buffer for transferring data between adjacent master computing units, and a high-speed buffer for matrix multiplication.
- the figure takes cannon algorithm for matrix multiplication as an example to show matrix sub-blocks loaded at each master computing unit.
- a master computing unit 1 loads A ik,11 block and B kj,11 block from the DDR via the L2 cache
- a master computing unit 2 loads A ik,12 block and B kj,22 block from the DDR via the L2 cache, and so on, to perform a first matrix multiplication operation, such as A ik,11 block ⁇ B kj,11 block .
- the master computing unit may also receive matrix sub-blocks from adjacent master computing units to further perform a matrix multiplication, thus obtaining corresponding matrix sub-blocks as intermediate results (such as C ij,11 block described above).
- the master computing unit 1 may receive A ik,12 block from the master computing unit 2 and receive B kj,21 block from the master computing unit 4 to perform a second matrix multiplication operation, such as A ik,12 block ⁇ B kj,21 block , according to cannon algorithm.
- the master computing unit 3 may receive B kj,22 block from the master computing unit 2 and receive A ik,22 block from the master computing unit 4 , to perform the second matrix multiplication operation according to cannon algorithm.
- each master computing unit may obtain a corresponding result matrix sub-block.
- the master computing unit 1 obtains a result matrix sub-block C ij,11 block by computing
- the master computing unit 2 obtains a result matrix sub-block C ij,12 block by computing, and so on.
- FIG. 8 is a schematic diagram where a computing sub-unit performs a matrix multiplication according to an embodiment of the present disclosure.
- each master computing unit in FIG. 7 may include a plurality of computing sub-units that perform a matrix multiplication in parallel, such as four computing sub-unit shown in FIG. 8 , including computing sub-units 1 ⁇ 4 .
- each computing sub-unit may acquire atomic matrices required for matrix multiplication from a L1 cache, where the atomic matrices are minimum matrix units for matrix multiplication supported by the computing sub-unit.
- each master computing unit is required to complete matrix multiplication twice when acquiring a corresponding result matrix sub-block, where each matrix multiplication operation includes a first round of computing and a second round of computing performed by four computing sub-units shown in FIG. 8 .
- the computing sub-unit 1 acquires A ik,ef,11 block and B kj,fg,11 block from the L1 cache to perform a matrix multiplication.
- the computing sub-unit 1 acquires A ik,ef,12 block and B ki,fg,21 block from the L1 cache to perform the matrix multiplication.
- the algorithm selection solution is to select an optimal algorithm from a plurality of matrix multiplication algorithms suitable for a matrix multiplication to perform a matrix multiplication.
- different search spaces may be set up by using different splitting methods on the on-chip system, thus finally obtaining different matrix multiplication algorithms. For example, when a splitting operation is performed at the computing sub-unit level, matrix sub-blocks of a first matrix may be split only(for example, in the M row direction), and matrix sub-blocks of a second matrix may not be split, thus obtaining a corresponding search space and finally forming a new matrix multiplication algorithm.
- the matrix sub-blocks of the second matrix may be split (for example, in the N column direction), and the matrix sub-blocks of the first matrix may not be split, thus forming another new matrix multiplication algorithm. It is contemplated that the two matrix multiplication algorithms obtained above do not perform splitting at the computing sub-unit level in the K direction (column direction for the first matrix and row direction for the second matrix).
- a global optimization goal may be set as the following in the algorithm space:
- FIG. 9 is a flowchart of a method 900 for selecting an optimal matrix multiplication algorithm according to an embodiment of the present disclosure.
- a cost function is determined, where the determination method for the cost function may be the method described above in combination with FIG. 2 , which is not repeated here.
- the following cost function may be determined:
- step S 904 a search space of each matrix multiplication algorithm in a plurality of matrix multiplication algorithms (which are the above plurality of “candidate algorithms”) is determined, and in step S 906 , search strides of the search space are determined.
- the determination methods for the search space and the search strides may refer to the aforementioned description and will not be repeated here.
- step S 908 search is performed by using a search algorithm (such as the above global search, neighborhood search, or genetic algorithm) and at the determined search strides, thus, in step S 910 , determining splitting coefficients corresponding to each matrix multiplication algorithm (such as splitting coefficients M bi , K bi , and N bi , for an i-th algorithm).
- step S 912 a cost function value of each matrix multiplication algorithm is computed, and in step S 914 , a matrix multiplication algorithm with a minimum cost function value is determined. Therefore, in step S 916 , the matrix multiplication algorithm with the minimum cost function value is selected as an optimal matrix multiplication algorithm, and corresponding splitting coefficients of the matrix multiplication algorithm are used to implement multiple levels of splitting on large matrices.
- an optimal algorithm may be selected from a plurality of algorithms for matrix multiplication.
- the selected algorithm may implement multiplication of large matrices with minimum operation cost, thus improving operation efficiency of matrix multiplication and reducing computing cost.
- resource usage of the on-chip system is maximized, thus taking full advantage of computing power of the on-chip system.
- FIG. 10 is a structural diagram of a combined processing apparatus 1000 according to an embodiment of the present disclosure.
- the combined processing apparatus includes a computing processing apparatus 1002 , an interface apparatus 1004 , other processing apparatus 1006 , and a storage apparatus 1008 .
- the computing processing apparatus may include one or a plurality of integrated circuit apparatuses 1010 .
- the integrated circuit apparatus may include the on-chip system described in the context of the present disclosure, and the on-chip system is configured to perform a matrix multiplication between matrices.
- the matrices may be large matrices or super-large matrices.
- the above large matrices or super-large matrices may be split based on splitting coefficients, thus obtaining matrix blocks suitable for matrix multiplication performed by the on-chip system.
- the computing processing apparatus of the present disclosure may be configured to perform an operation specified by a user, such as matrix multiplication of the present disclosure.
- the computing processing apparatus may be implemented as (or may include) a single-core artificial intelligence processor or a multi-core artificial intelligence processor.
- one or a plurality of computing apparatuses included in the computing processing apparatus may be implemented as an artificial intelligence processor core or a partial hardware structure of the artificial intelligence processor core. If the plurality of computing apparatuses are implemented as artificial intelligence processor cores or partial hardware structures of the artificial intelligence processor cores, the computing processing apparatus of the present disclosure may be regarded as having a single-core structure or an isomorphic multi-core structure.
- the computing processing apparatus of the present disclosure may interact with other processing apparatus through the interface apparatus, so as to jointly complete the operation specified by the user.
- other processing apparatuses of the present disclosure may include one or more types of general and/or dedicated processors, including a central processing unit (CPU), a graphics processing unit (GPU), an artificial intelligence processor, and the like.
- processors include but are not limited to a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic components, discrete gate or transistor logic components, discrete hardware components, and the like.
- DSP digital signal processor
- ASIC application specific integrated circuit
- FPGA field-programmable gate array
- the number of the processors may be determined according to actual requirements.
- the computing processing apparatus of the present disclosure may be regarded as having the single-core structure or the isomorphic multi-core structure. However, when the computing processing apparatus and other processing apparatus are considered together, both the computing processing apparatus and other processing apparatus may be regarded as forming a heterogeneous multi-core structure.
- other processing apparatus may serve as an interface between the computing processing apparatus (which may be embodied as an artificial intelligence operation apparatus such as a neural network operation apparatus) of the present disclosure and external data and controls.
- Other processing apparatus may perform basic controls that include but are not limited to moving data, and starting and/or stopping the computing apparatus.
- other processing apparatus may also cooperate with the computing processing apparatus to jointly complete an operation task.
- the interface apparatus may be used to transfer data and a control instruction between the computing processing apparatus and other processing apparatus.
- the computing processing apparatus may acquire input data from other processing apparatus via the interface apparatus and write the input data to an on-chip storage apparatus (or called a memory) of the computing processing apparatus.
- the computing processing apparatus may acquire the control instruction from other processing apparatus via the interface apparatus and write the control instruction to an on-chip control cache of the computing processing apparatus.
- the interface apparatus may further read data in the storage apparatus of the computing processing apparatus and then transfer the data to other processing apparatus.
- the combined processing apparatus of the present disclosure may further include a storage apparatus. As shown in the figure, the storage apparatus is connected to the computing processing apparatus and other processing apparatus, respectively.
- the storage apparatus may be used to save data of the computing processing apparatus and/or other processing apparatus.
- the data may be data that may not be fully saved in the internal or the on-chip storage apparatus of the computing processing apparatus or other processing apparatus.
- the present disclosure also discloses a chip (such as a chip 1102 shown in FIG. 11 ).
- the chip is a system-on-chip (SoC) and integrates one or a plurality of combined processing apparatuses shown in FIG. 10 and may be configured to perform a matrix multiplication between matrix blocks.
- SoC system-on-chip
- the chip may be connected to other related components through an external interface apparatus (such as an external interface apparatus 1106 shown in FIG. 11 ).
- the related components may be, for example, a camera, a monitor, a mouse, a keyboard, a network card, or a WIFI interface.
- the chip may integrate other processing units (such as a video codec) and/or an interface unit (such as a dynamic random-access memory (DRAM) interface), and the like.
- the present disclosure also discloses a chip package structure, including the chip.
- the present disclosure discloses a board card, including the chip package structure. The board card will be described in detail in combination with FIG. 11 below.
- FIG. 11 is a schematic structural diagram of a board card 1100 according to an embodiment of the present disclosure.
- the board card includes a storage component 1104 used for storing data.
- the storage component 1604 includes one or a plurality of storage units 1110 .
- the storage component may be connected to and may transfer data to a control component 1108 and the chip 1102 through a bus.
- the board card further includes an external interface apparatus 1106 , which is configured to implement data relay or transfer between the chip (or the chip in the chip package structure) and an external device 1112 (such as a server or a computer, and the like). For example, to-be-processed data may be transferred from the external device to the chip through the external interface apparatus.
- a computing result of the chip may still be sent back to the external device through the external interface apparatus.
- the external interface apparatus may have different interface forms.
- the external interface apparatus may adopt a standard peripheral component interface express (PCIe) interface.
- PCIe peripheral component interface express
- the control component in the board card of the present disclosure may be configured to regulate and control the state of the chip.
- the control component may include a micro controller unit (MC U), which may be used to regulate and control the working state of the chip.
- the present disclosure also discloses an electronic device or apparatus.
- the electronic device or apparatus may include one or a plurality of board cards, one or a plurality of the chips, and/or one or a plurality of the combined processing apparatuses.
- the electronic device or apparatus may be configured to perform a matrix multiplication discussed in the context of the present disclosure and matrix data participating in the matrix multiplication is a matrix block obtained after splitting based on optimal splitting coefficients disclosed herein.
- an electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a PC device, an Internet of Things terminal, a mobile terminal, a mobile phone, a traffic recorder, a navigator, a sensor, a webcam, a camera, a video camera, a projector, a watch, a headphone, a mobile storage, a wearable device, a visual terminal, an autonomous driving terminal, a vehicle, a household appliance, and/or a medical device.
- a server a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a PC device, an Internet of Things terminal, a mobile terminal, a mobile phone, a traffic recorder, a navigator, a sensor, a webcam, a camera, a video camera,
- the vehicle includes an airplane, a ship, and/or a car;
- the household appliance includes a television, an air conditioner, a microwave oven, a refrigerator, an electric rice cooker, a humidifier, a washing machine, an electric lamp, a gas cooker, and a range hood;
- the medical device includes a nuclear magnetic resonance spectrometer, a B-ultrasonic scanner, and/or an electrocardiograph.
- the electronic device or apparatus of the present disclosure may be further applied to Internet, Internet of Things, data center, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction sites, medical, and other fields.
- the electronic device or apparatus of the present disclosure may be further used in application scenarios including cloud, edge, and terminal related to artificial intelligence, big data, and/or cloud computing.
- an electronic device or apparatus with high computing power may be applied to a cloud device (such as the cloud server), while an electronic device or apparatus with low power consumption may be applied to a terminal device and/or an edge device (such as a smart phone or the webcam).
- an edge device such as a smart phone or the webcam.
- hardware information of the cloud device is compatible with that of the terminal device and/or the edge device.
- appropriate hardware resources may be matched from hardware resources of the cloud device to simulate hardware resources of the terminal device and/or the edge device to complete unified management, scheduling, and collaborative work of terminal-cloud integration or cloud-edge-terminal integration.
- the present disclosure divides the units on the basis of considering logical functions, but there may be other division methods during actual implementations.
- a plurality of units or components may be combined or integrated into another system, or some features or functions in the units or components may be selectively disabled.
- the connection discussed above in combination with drawings may be direct or indirect coupling between the units or components.
- the direct or indirect coupling involves a communication connection using an interface.
- the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.
- units described as separate components may be or may not be physically separated.
- Components shown as units may be or may not be physical units.
- the components or units may be located in a same position or distributed to a plurality of network units. Additionally, according to actual requirements, some or all of the units may be selected for achieving the purpose of the solution described in the embodiments of the present disclosure. Additionally, in some scenarios, the plurality of units in the embodiments of the present disclosure may be integrated into one unit, or each of the units may be physically separated.
- the integrated unit may be implemented in the form of a software program unit. If the integrated unit is implemented in the form of the software program unit and sold or used as an independent product, the integrated unit may be stored in a computer-readable memory. Based on this, when the solution of the present disclosure is embodied in the form of a software product (such as a computer-readable storage medium), the software product may be stored in a memory.
- the software product may include several instructions used to enable a computer device (which may be a personal computer, a server, or a network device, and the like) to perform part or all of steps of the method of the embodiments of the present disclosure.
- the memory includes but is not limited to an USB, a flash disk, a read only memory (ROM), a random-access memory (RAM), a mobile hard disk, a magnetic disk, or an optical disc, and other media that may store a program code.
- the integrated unit may be implemented in the form of hardware.
- the hardware may be a specific hardware circuit, which may include a digital circuit and/or an analog circuit, and the like.
- a physical implementation of a hardware structure of the circuit includes but is not limited to a physical component.
- the physical component includes but is not limited to a transistor, or a memristor, and the like.
- various apparatuses such as the computing apparatus or other processing apparatus described in the present disclosure may be implemented by an appropriate hardware processor, such as a central processing unit (CPU), a graphics processing unit (GPU), a field-programmable gate array (FPGA), a digital signal processor (DSP), and an application-specific integrated circuit (ASIC), and the like.
- CPU central processing unit
- GPU graphics processing unit
- FPGA field-programmable gate array
- DSP digital signal processor
- ASIC application-specific integrated circuit
- the storage unit or the storage apparatus may be any appropriate storage medium (including a magnetic storage medium or a magneto-optical storage medium), such as a resistive random access memory (RRAM), a dynamic random access memory (DRAM), a static random access memory (SRAM), an enhanced dynamic random access memory (EDRAM), a high bandwidth memory (HBM), a hybrid memory cube (HMC), the ROM, and the RAM, and the like.
- RRAM resistive random access memory
- DRAM dynamic random access memory
- SRAM static random access memory
- EDRAM enhanced dynamic random access memory
- HBM high bandwidth memory
- HMC hybrid memory cube
- any module, unit, component, server, computer, terminal or device performing an instruction of the embodiment of the present disclosure may include or access a computer-readable medium in another way, such as a storage medium, a computer storage medium, or a data storage device (removable and/or non-removable) such as a disk, a compact disc, or a magnetic tape.
- the computer storage medium may include volatile and non-volatile, movable and immovable media implemented by any method or technology used to store information, such as a computer-readable instruction, a data structure, a program module, or other data.
- a clause “if it is determined that” or “if [a described condition or event] is detected” may be interpreted as “once it is determined that”, or “in response to a determination”, or “once [a described condition or event] is detected”, or “in response to a case where [a described condition or event] is detected”.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- Pure & Applied Mathematics (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Microelectronics & Electronic Packaging (AREA)
- Complex Calculations (AREA)
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110414133.3 | 2021-04-16 | ||
CN202110414133.3A CN115221101B (zh) | 2021-04-16 | 2021-04-16 | 用于优化片上系统的矩阵乘操作的方法和相关产品 |
PCT/CN2022/086815 WO2022218374A1 (zh) | 2021-04-16 | 2022-04-14 | 用于优化片上系统的矩阵乘操作的方法和相关产品 |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/086815 Continuation WO2022218374A1 (zh) | 2021-04-16 | 2022-04-14 | 用于优化片上系统的矩阵乘操作的方法和相关产品 |
Publications (1)
Publication Number | Publication Date |
---|---|
US20240028666A1 true US20240028666A1 (en) | 2024-01-25 |
Family
ID=83605184
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/374,817 Pending US20240028666A1 (en) | 2021-04-16 | 2023-09-29 | Method for optimizing matrix multiplication operation on system on chip, and related product |
Country Status (4)
Country | Link |
---|---|
US (1) | US20240028666A1 (zh) |
EP (1) | EP4325373A1 (zh) |
CN (1) | CN115221101B (zh) |
WO (1) | WO2022218374A1 (zh) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118733206A (zh) * | 2023-03-30 | 2024-10-01 | 中科寒武纪科技股份有限公司 | 基于多核系统的任务调度方法、装置及相关产品 |
Family Cites Families (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0425296A3 (en) * | 1989-10-27 | 1992-10-14 | Texas Instruments Incorporated | Speedup for solution of systems of linear equations |
US7792895B1 (en) * | 2006-06-16 | 2010-09-07 | Nvidia Corporation | Efficient matrix multiplication on a parallel processing device |
CN102375721B (zh) * | 2010-08-23 | 2016-03-30 | 联想(北京)有限公司 | 一种矩阵乘法运算方法、图形处理器和电子设备 |
US20120316886A1 (en) * | 2011-06-08 | 2012-12-13 | Ramin Pishehvar | Sparse coding using object exttraction |
US20170116156A1 (en) * | 2015-10-22 | 2017-04-27 | International Business Machines Corporation | Parallelizing matrix factorization across hardware accelerators |
US10032247B2 (en) * | 2016-06-22 | 2018-07-24 | Palo Alto Research Center Incorporated | System and method for speeding up general matrix-vector multiplication on GPU |
CN108090028B (zh) * | 2017-12-15 | 2021-03-23 | 中科寒武纪科技股份有限公司 | 一种计算方法及相关产品 |
KR101990735B1 (ko) * | 2018-03-30 | 2019-06-18 | 서울대학교산학협력단 | 사전 그래프 분할 기반 행렬 벡터 곱을 이용한 대규모 그래프 마이닝 방법 및 장치 |
JP7132043B2 (ja) * | 2018-09-10 | 2022-09-06 | 東京計器株式会社 | リコンフィギュラブルプロセッサ |
CN111191699B (zh) * | 2019-12-22 | 2022-10-21 | 中国人民解放军陆军工程大学 | 基于非负矩阵分解和划分自适应融合的多视角聚类方法 |
CN111028136B (zh) * | 2019-12-24 | 2023-04-07 | 上海寒武纪信息科技有限公司 | 一种人工智能处理器处理二维复数矩阵的方法和设备 |
CN111125628A (zh) * | 2019-12-24 | 2020-05-08 | 上海寒武纪信息科技有限公司 | 人工智能处理器处理二维数据矩阵的方法和设备 |
CN111523642B (zh) * | 2020-04-10 | 2023-03-28 | 星宸科技股份有限公司 | 用于卷积运算的数据重用方法、运算方法及装置、芯片 |
CN112541159A (zh) * | 2020-09-30 | 2021-03-23 | 华为技术有限公司 | 一种模型训练方法及相关设备 |
CN112416433B (zh) * | 2020-11-24 | 2023-01-17 | 中科寒武纪科技股份有限公司 | 一种数据处理装置、数据处理方法及相关产品 |
-
2021
- 2021-04-16 CN CN202110414133.3A patent/CN115221101B/zh active Active
-
2022
- 2022-04-14 EP EP22787596.0A patent/EP4325373A1/en active Pending
- 2022-04-14 WO PCT/CN2022/086815 patent/WO2022218374A1/zh active Application Filing
-
2023
- 2023-09-29 US US18/374,817 patent/US20240028666A1/en active Pending
Also Published As
Publication number | Publication date |
---|---|
EP4325373A1 (en) | 2024-02-21 |
CN115221101A (zh) | 2022-10-21 |
CN115221101B (zh) | 2023-12-19 |
WO2022218374A1 (zh) | 2022-10-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Chen et al. | Warplda: a cache efficient o (1) algorithm for latent dirichlet allocation | |
WO2020073211A1 (zh) | 运算加速器、处理方法及相关设备 | |
CN110546611A (zh) | 通过跳过处理操作来减少神经网络处理器中的功耗 | |
CN115221102B (zh) | 用于优化片上系统的卷积运算操作的方法和相关产品 | |
CN106846235B (zh) | 一种利用NVIDIA Kepler GPU汇编指令加速的卷积优化方法及系统 | |
US20240028666A1 (en) | Method for optimizing matrix multiplication operation on system on chip, and related product | |
CN112799726B (zh) | 数据处理装置、方法及相关产品 | |
CN112686379B (zh) | 集成电路装置、电子设备、板卡和计算方法 | |
US20200242468A1 (en) | Neural network computation device, neural network computation method and related products | |
CN114580606A (zh) | 数据处理方法、装置、计算机设备和存储介质 | |
US11775808B2 (en) | Neural network computation device and method | |
CN111125628A (zh) | 人工智能处理器处理二维数据矩阵的方法和设备 | |
US10915470B2 (en) | Memory system | |
CN111061507A (zh) | 运算方法、装置、计算机设备和存储介质 | |
CN111143766A (zh) | 人工智能处理器处理二维复数矩阵的方法和设备 | |
US20220283719A1 (en) | Visualizing memory bandwidth utilization using memory bandwidth stack | |
CN113469333B (zh) | 执行神经网络模型的人工智能处理器、方法及相关产品 | |
CN112817898B (zh) | 数据传输方法、处理器、芯片及电子设备 | |
CN116415100A (zh) | 业务处理方法、装置、处理器及计算设备 | |
CN112596881B (zh) | 存储部件及人工智能处理器 | |
CN111382852B (zh) | 数据处理装置、方法、芯片及电子设备 | |
CN112486402A (zh) | 一种存储节点及系统 | |
US20230305840A1 (en) | Computing apparatus, integrated circuit chip, board card, device and computing method | |
CN112232498B (zh) | 一种数据处理装置、集成电路芯片、电子设备、板卡和方法 | |
WO2022143799A1 (zh) | 用于矩阵乘操作的集成电路装置、计算设备、系统和方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: CAMBRICON TECHNOLOGIES CORPORATION LIMITED, CHINA Free format text: NON-DISCLOSURE AGREEMENT;ASSIGNOR:JIANG, GUANG;REEL/FRAME:065084/0945 Effective date: 20181012 Owner name: SHANGHAI CAMBRICON INFORMATION TECHNOLOGY CO., LTD, CHINA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SUN, ZHENG;LI, MING;YU, YEHAO;AND OTHERS;REEL/FRAME:065072/0879 Effective date: 20230919 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |