CN116881618A - General matrix multiplication calculation optimization method, device and processor - Google Patents

General matrix multiplication calculation optimization method, device and processor Download PDF

Info

Publication number
CN116881618A
CN116881618A CN202311078065.3A CN202311078065A CN116881618A CN 116881618 A CN116881618 A CN 116881618A CN 202311078065 A CN202311078065 A CN 202311078065A CN 116881618 A CN116881618 A CN 116881618A
Authority
CN
China
Prior art keywords
matrix
calculation
cores
computing
computation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311078065.3A
Other languages
Chinese (zh)
Other versions
CN116881618B (en
Inventor
孙红江
陈晨
杨贺淞
范文杰
常璟飞
陈�光
曾令仿
李勇
程稳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202311078065.3A priority Critical patent/CN116881618B/en
Publication of CN116881618A publication Critical patent/CN116881618A/en
Application granted granted Critical
Publication of CN116881618B publication Critical patent/CN116881618B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Mathematics (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Complex Calculations (AREA)

Abstract

The application relates to a general matrix multiplication computation optimization method, a device and a processor, wherein the method is applied to the processor, the processor comprises at least one computation core, the computation core comprises an arithmetic logic unit, a data cache and a register, and the method comprises the following steps: determining the size of a universal matrix multiplier kernel based on the width of an arithmetic logic unit, the number of registers, the capacity of a data cache, and the number of predetermined computing cores for constituting the universal matrix multiplier kernel; optimizing the number of computing cores for parallel computing based on the size of an operator kernel, the size of a predetermined basic block matrix, and the sizes of a left matrix and a right matrix; the method optimizes the block calculation of the general matrix multiplication calculation area in the data cache based on the number of calculation cores of parallel calculation, the size of a basic block matrix, the size of a left matrix and the size of a right matrix, and solves the problems of lower utilization rate of general matrix multiplication calculation hardware resources and higher data access cost.

Description

General matrix multiplication calculation optimization method, device and processor
Technical Field
The present application relates to the field of hardware computing technologies, and in particular, to a method and apparatus for optimizing general matrix multiplication computation, and a processor.
Background
General matrix multiplication (GEMM), convolution, and the like are the main computational contents in the field of deep learning. The theoretical peak throughput of a high performance processor in computing GEMM can be approximated by its design, but this requires that as much of the underlying physical compute core be occupied as possible in operator design. Even though the Arithmetic Logic Unit (ALU) width and the number of registers of a high performance processor are continuously expanding, the large-scale GEMM problem in the face of a large language model and other scenes still needs to start a huge number of ALU iterative computation and memory access operations, and if synchronous waiting is considered in a parallel strategy, huge performance loss is caused. For a processor of a SIMD architecture, the processing of a multiply-accumulate (MAC) iteration unit cannot be matched with an operator kernel, which results in a large amount of empty computing resources and a large amount of splicing work of data memory blocks, so that the utilization rate of hardware resources is low, and the data memory overhead is high.
Aiming at the problems of low utilization rate of hardware resources and high data access cost existing in the conventional general matrix multiplication calculation, no effective solution is proposed at present.
Disclosure of Invention
The embodiment provides a general matrix multiplication calculation optimization method, a general matrix multiplication calculation optimization device and a general matrix multiplication calculation processor, so that the problems of low hardware resource utilization rate and high data access overhead in general matrix multiplication calculation in the related technology are solved.
In a first aspect, in this embodiment, there is provided a general matrix multiplication computation optimization method, the method being applied to a processor including at least one computation core including an arithmetic logic unit, a data cache, and registers, the method including:
determining the size of a general matrix multiplier kernel based on the width of the arithmetic logic unit, the number of registers, the capacity of the data cache, and a predetermined number of computational cores for constructing the general matrix multiplier kernel;
optimizing the number of computing cores of parallel computing based on the size of the kernel of the universal matrix multiplication operator, the size of a predetermined basic block matrix, and the sizes of a left matrix and a right matrix to be subjected to universal matrix multiplication;
and optimizing the block calculation of the general matrix multiplication calculation area in the data cache based on the number of the calculation cores of the parallel calculation, the size of the basic block matrix and the sizes of the left matrix and the right matrix.
In some embodiments, optimizing the block computation of the general matrix multiplication computation area in the data cache based on the number of computation cores of the parallel computation, the size of the basic block matrix, and the sizes of the left matrix and the right matrix includes:
Determining the iteration times of the computing cores, the initial block number of the non-reduction dimension of the computing area and the iteration times of the initial blocks based on the sizes of the left matrix, the right matrix and the basic block matrix and the number of the computing cores of the parallel computing;
and under the condition that the iteration times of the computing cores are larger than the iteration times of the initial blocks, carrying out combination computation and split computation of the reduction dimensions on the initial blocks to form target blocks, wherein the iteration times of the target blocks are equal to the iteration times of the computing cores.
In some of these embodiments, performing a merging calculation and splitting calculation of the reduction dimension on the initial partition to form a target partition includes:
sequentially performing iterative computation on each initial block based on the arrangement sequence of each initial block in the general matrix multiplication computation area;
when the initial blocks are combined and calculated, accumulating iterative calculation values of at least two combined initial blocks;
and when the initial block is split, accumulating the iterative calculation value of the first split block of the initial block with the iterative calculation value of the former initial block, and accumulating the iterative calculation value of the second split block with the iterative calculation value of the latter initial block.
In some of these embodiments, the determining the number of iterations of computing cores, the number of initial partitions of the computing region non-reduction dimension, and the number of iterations of initial partitions based on the sizes of the left matrix, the right matrix, and the basic block matrix, and the number of computing cores of the parallel computing includes:
determining the initial block number, the iteration number and the total iteration number of the initial block of the non-reduction dimension of the calculation region based on the sizes of the left matrix, the right matrix and the basic block matrix;
and determining the iteration times of the computation cores based on the number of the computation cores of the parallel computation and the total iteration times.
In some of these embodiments, optimizing the number of computational cores for parallel computation based on the size of the generic matrix multiplier kernel, the size of a predetermined basic block matrix, and the sizes of the left and right matrices to be subjected to generic matrix multiplication includes:
acquiring total iteration times based on the sizes of the left matrix, the right matrix and the basic block matrix;
and optimizing the number of the preset parallel computing cores based on the total iteration times, the reduction dimension of the universal matrix multiplying operator kernel and the reduction dimension of the basic block matrix to obtain an optimal value of the number of the computing cores.
In some embodiments, the optimizing the number of computing cores of the parallel computing set in advance based on the total iteration number, the reduced dimension of the universal matrix multiplier kernel, and the reduced dimension of the basic block matrix, to obtain the optimal value of the number of computing cores includes:
determining the iteration times of the computing cores based on the total iteration times and the preset number of the computing cores of parallel computing;
determining whether the iteration times of the computing core meet preset constraint conditions or not, wherein the constraint conditions are determined by the dimension of the general matrix multiplying operator kernel and the dimension of the basic block matrix;
and under the condition that the constraint condition is not met, adjusting the number of the computing cores of the parallel computing until the iteration times of the corresponding computing cores meet the constraint condition.
In some of these embodiments, the constraint is expressed by the following equation:
wherein,,to calculate the iteration number of the core, +.>For the reduced dimension of the basic block matrix,and c is a predetermined scaling factor for the dimension of the general matrix multiplication operator kernel, n is a value of the calculation region according to the block number of the block of the dimension, and the value range of n is expressed by the following formula:
In some of these embodiments, the determining the size of the generic matrix multiplier kernel based on the width of the arithmetic logic unit, the number of registers, the capacity of the data cache, and a predetermined number of computational cores to construct the generic matrix multiplier kernel comprises:
determining the data quantity processed by the computing core in each iteration based on the width of the arithmetic logic unit and the number of registers;
determining the non-reduced dimension of the universal matrix multiplier kernel based on the number of computing cores and the data volume processed by each iteration of the computing cores;
and determining the reduction dimension of the universal matrix multiplication operator kernel based on the capacity of the data cache corresponding to the left matrix and the right matrix and the non-reduction dimension.
In a second aspect, in this embodiment, there is provided a general matrix multiplication computation optimization apparatus, the apparatus being applied to a processor including at least one computation core including an arithmetic logic unit, a data cache, and registers, the apparatus comprising:
a first determining module, configured to determine a size of a universal matrix multiplier kernel based on a width of the arithmetic logic unit, a number of registers, a capacity of the data cache, and a predetermined number of computation cores for constituting the universal matrix multiplier kernel;
The first optimizing module is used for optimizing the number of calculation cores of parallel calculation based on the size of the kernel of the universal matrix multiplication operator, the size of a predetermined basic block matrix, and the sizes of a left matrix and a right matrix to be subjected to universal matrix multiplication;
and the second optimization module is used for optimizing the block calculation of the general matrix multiplication calculation area in the data cache based on the number of the calculation cores of the parallel calculation, the size of the basic block matrix and the sizes of the left matrix and the right matrix.
In a third aspect, in this embodiment, there is provided a processor including the general matrix multiplication computation optimization device according to the second aspect, and at least one computation core including an arithmetic logic unit, a data cache, and registers.
Compared with the related art, the general matrix multiplication optimization method provided in the embodiment determines the size of the general matrix multiplication operator kernel by determining the width of the arithmetic logic unit based on the arithmetic logic unit of the processor calculation kernel, the number of registers, the capacity of data cache and the number of the predetermined calculation kernels for forming the general matrix multiplication operator kernel, logically widens the width of the arithmetic logic unit, and improves the utilization rate of hardware calculation resources by combining calculation resources; optimizing the number of calculation cores of parallel calculation based on the size of the kernel of the universal matrix multiplication operator, the size of a predetermined basic block matrix and the sizes of a left matrix and a right matrix to be subjected to universal matrix multiplication, ensuring that the multiplication accumulation process on each calculation core of the parallel calculation is matched with the reduction dimension of the kernel of the combined operator, so that the full-load calculation of the kernel of the operator on the reduction dimension can be realized in each iteration, and the fragmentation calculation problem and access consumption caused by segmentation are reduced; the block calculation of the general matrix multiplication calculation area in the data cache is optimized by the number of calculation cores based on parallel calculation, the size of the basic block matrix, the size of the left matrix and the size of the right matrix, the throughput of the calculation cores is improved, the waste of calculation resources is avoided, the access efficiency of the data cache and registers is improved, and the problems of low utilization rate of the general matrix multiplication calculation hardware resources and high data access cost in the related technology are solved.
The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the other features, objects, and advantages of the application.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
FIG. 1 is a schematic diagram of a processor computational core architecture of a general matrix multiplication computation optimization method according to some embodiments of the present application;
FIG. 2 is a flow chart of a generic matrix multiplication computation optimization method of some embodiments of the application;
FIG. 3 is a schematic diagram of a generic matrix multiplication computation process according to some embodiments of the application;
FIG. 4 is a flow chart of a block calculation optimization of a generic matrix multiplication calculation region in accordance with some embodiments of the application;
FIG. 5 is an initial block diagram of a generic matrix multiplication computation area according to some embodiments of the application;
FIG. 6 is a schematic diagram of a target block for a generic matrix multiplication computation area in accordance with some embodiments of the application;
FIG. 7 is a flow chart of merging and splitting initial partitions to form target partitions according to some embodiments of the application;
FIG. 8 is a flow chart of a method of optimizing the number of compute cores of a parallel computation in accordance with some embodiments of the application;
FIG. 9 is a flow chart of obtaining an optimal value for the number of compute cores in accordance with some embodiments of the application;
FIG. 10 is a flow chart of determining a generic matrix multiplier kernel size according to some embodiments of the present application;
FIG. 11 is a flow chart of a generic matrix multiplication computation optimization method in accordance with some preferred embodiments of the present application;
FIG. 12 is a block diagram of a generic matrix multiply computation optimization apparatus according to some embodiments of the application.
Detailed Description
The present application will be described and illustrated with reference to the accompanying drawings and examples for a clearer understanding of the objects, technical solutions and advantages of the present application. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
Unless defined otherwise, technical or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terms "a," "an," "the," "these" and similar terms in this application are not intended to be limiting in number, but may be singular or plural. The terms "comprising," "including," "having," and any variations thereof, as used herein, are intended to encompass non-exclusive inclusion; for example, a process, method, and system, article, or apparatus that comprises a list of steps or modules (units) is not limited to the list of steps or modules (units), but may include other steps or modules (units) not listed or inherent to such process, method, article, or apparatus. The terms "connected," "coupled," and the like in this disclosure are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as used herein means two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., "a and/or B" may mean: a exists alone, A and B exist together, and B exists alone. Typically, the character "/" indicates that the associated object is an "or" relationship. The terms "first," "second," "third," and the like, as referred to in this disclosure, merely distinguish similar objects and do not represent a particular ordering for objects.
The general matrix multiplication calculation optimization method provided by the embodiment of the application can be executed in a processor of a server, a computer, a terminal or a similar computing device. FIG. 1 is a schematic diagram of a processor computational core architecture of a general matrix multiplication computation optimization method according to some embodiments of the present application. As shown in fig. 1, the processor shown in the figure includes 4 computing cores 11, each computing core 11 including an arithmetic logic unit, a data cache, an instruction cache, and registers. The width of the arithmetic logic unit is 8, the total number of registers is 12, and the number of accumulation registers after compiler optimization is 4. For example, among the 4 calculation cores 11 shown in fig. 1, the accumulation registers of the lower right calculation core 11 are the 0, 4, 7, and a registers. The width of the arithmetic logic unit is 8, i.e. a calculation of 8 data can be processed at a time, the data can be floating point numbers, and the number of bits of each floating point number can be determined based on the hardware structure of the computer, such as 32 bits, 64 bits, etc. For general matrix multiplication, an 8-bit arithmetic logic unit may perform 8 x 8 matrix multiplication simultaneously. It will be appreciated by those of ordinary skill in the art that the configuration shown in FIG. 1 is merely illustrative and is not intended to limit the configuration of the processor described above. For example, the processor may also include more or fewer compute cores than shown in FIG. 1, or registers and arithmetic logic units in the compute cores may have different configurations than shown in FIG. 1.
The computer on which the processor is located may further include a memory for storing data, and may further include a transmission device for a communication function and an input-output device. The memory may be used to store a computer program, such as a computer program corresponding to the general matrix multiplication computation optimization method in the present embodiment, and by running the computer program stored in the memory, various functional applications and data processing are executed, that is, the method described above is implemented. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory. In some embodiments, the memory may further include memory remotely located with respect to the processor, which may be connected to the terminal through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
In this embodiment, a general matrix multiplication optimization method is provided, and fig. 2 is a flowchart of the general matrix multiplication optimization method according to some embodiments of the present application, as shown in fig. 2, where the flowchart includes the following steps:
Step S201, determining the size of the universal matrix multiplier kernel based on the width of the arithmetic logic unit, the number of registers, the capacity of the data buffer, and the number of predetermined computation cores for constituting the universal matrix multiplier kernel.
Operators refer to different types of computational functions that constitute a deep learning algorithm, such as convolution operators in a convolution layer, weight summation operators in a fully connected layer, reLU operators, and the like. The computation function used for the generic matrix multiplication computation may be referred to as a generic matrix multiplication operator. For matrix multiplication operations: c (M, N) =a (M, K) ×b (K, N), matrix C (M rows and N columns) is composed of left matrix a (M rows and K columns)) Multiplied by the right matrix B (K rows and N columns). The size of the universal matrix multiplication operator kernel refers to the maximum number of processes that the processor can read the data of the matrix A, B from the memory into the register at a time, and includes the non-reduced dimension (i.e., the dimension in which M and N are located) size and the reduced dimension (i.e., the dimension in which K is located) size of the operator kernel. For example, assume that the generic matrix multiplier kernel is of size=16、/>=8、/>=16, i.e. the processor can read at most 16×8 data of matrix a and 8×16 data of matrix B from memory at a time. The larger the size of the operator kernel, the fewer memory accesses are required to complete the entire generic matrix multiplication computation.
In performing matrix multiplication based on a generic matrix multiplier, fig. 3 is a schematic diagram of a generic matrix multiplication process according to some embodiments of the present application, and as shown in fig. 3, it is assumed that the number of predetermined computation cores for constituting the core of the generic matrix multiplier is 4. The general matrix multiplying operator kernel can be formed by combining the computation on 4 computation cores, the left side of fig. 3 is a kernel result matrix, and the right side of fig. 3 is the kernel sizes and computation directions of the left matrix and the right matrix respectively. The combined operator kernel is assumed to be as follows:
wherein the method comprises the steps ofIs->×/>Matrix block->Is->×/>Matrix block->Is->×/>Is a matrix block of (c). The size of the operator kernel with the highest resource utilization rate, namely +.>、/>、/>Is a value of (2).
Step S202, optimizing the number of computing cores of parallel computing based on the size of the kernel of the universal matrix multiplication operator, the size of the predetermined basic block matrix, and the sizes of the left matrix and the right matrix to be subjected to universal matrix multiplication.
Normally, performing a matrix multiplication requires 2×m×n×k read operations and m×n write operations from memory, and when the matrix A, B is relatively large, a large number of memory access operations makes the general matrix multiplication inefficient. In this case, the general matrix multiplication calculation may be performed using a block calculation method, i.e., dividing the matrix A, B and the result matrix into a plurality of basic block matrices, and transferring the basic block matrices to be calculated from the memory to the data cache During calculation, the processor can directly read data from the data cache, so that the reading speed is improved and the access times are reduced. The size of the basic block matrix refers to the size of each small block matrix after the matrix A, B and the result matrix are blocked, and can be used、/>、/>And (3) representing. One dimension of the left matrix A is taken every time to be +.>×/>One dimension of the basic block matrix and the right matrix B is +.>×/>Is put into a data buffer and calculated, called an iteration.
On the other hand, the iteration times can be shared by a mode of parallel calculation of a plurality of calculation cores, so that the calculation efficiency is improved. In practical applications, an initial value of the number of computation cores of a parallel computation is usually set, but the initial value is not necessarily an optimal value for maximizing computation performance. Based on the calculation performance maximization consideration, the optimization direction for the optimal value is: according to the sizes of the left matrix and the right matrix of the general matrix multiplication to be executed and the size of the basic block matrix, the total iteration times and the iteration times of each computing core can be obtained; determining whether the initial value of the number of the computing cores is an optimal value according to the iteration times, the size of the kernel of the universal matrix multiplying operator and the size of the basic block matrix; if the calculation core number is not the optimal value, the parameters are synthesized to adjust the initial value, and the optimal value of the calculation core number of the parallel calculation is obtained.
Step S203, optimizing the block calculation of the general matrix multiplication calculation area in the data cache based on the number of calculation cores of the parallel calculation, the size of the basic block matrix, and the sizes of the left matrix and the right matrix.
The general matrix-by-block computation of the matrix A, B is optimized on the basis of the number of computation cores of the optimized parallel computation obtained in step S202. A common general matrix-by-block calculation is to block in the non-reduced dimension (i.e., the dimension in which M and N are located) according to the size of the basic block matrix. The partitioning scheme is not necessarily adapted to the capacity of the data cache and the number of computing cores of parallel computing, which often limits the utilization rate of hardware resources, and may not be capable of completing all computing amounts in parallel at one time. The number of the computing cores of the parallel computing in this embodiment is already optimized, and the original partitioning scheme is optimized according to the number of the computing cores, the sizes of the left matrix a and the right matrix B and the sizes of the basic block matrix, and the optimization mode is that on the basis of the original partitioning scheme, the original partitioning is split and combined on the matrix multiplication computing accumulation dimension (reduce) to form a new partitioning, so that all computing quantities are completed in one time in parallel.
Through the steps S201-S203, the size of the universal matrix multiplier kernel is determined based on the width of the arithmetic logic unit of the processor computing kernel, the number of registers, the capacity of the data cache and the number of the predetermined computing kernels for forming the universal matrix multiplier kernel, so that the width of the arithmetic logic unit is logically widened, and the utilization rate of hardware computing resources is improved through combining computing resources; optimizing the number of calculation cores of parallel calculation based on the size of the kernel of the universal matrix multiplication operator, the size of a predetermined basic block matrix and the sizes of a left matrix and a right matrix to be subjected to universal matrix multiplication, ensuring that the multiplication accumulation process on each calculation core of the parallel calculation is matched with the reduction dimension of the kernel of the combined operator, so that the full-load calculation of the kernel of the operator on the reduction dimension can be realized in each iteration, and the fragmentation calculation problem and access consumption caused by segmentation are reduced; the block calculation of the general matrix multiplication calculation area in the data cache is optimized by the number of calculation cores based on parallel calculation, the size of the basic block matrix, the size of the left matrix and the size of the right matrix, the throughput of the calculation cores is improved, the waste of calculation resources is avoided, the access efficiency of the data cache and registers is improved, and the problems of low utilization rate of the general matrix multiplication calculation hardware resources and high data access cost in the related technology are solved.
In some embodiments, FIG. 4 is a flow chart of block calculation optimization of a generic matrix multiplication computation area of some embodiments of the present application, as shown in FIG. 4, the flow comprising the steps of:
in step S401, the number of iterations of the computation cores, the number of initial blocks of the computation area non-reduction dimension, and the number of iterations of the initial blocks are determined based on the sizes of the left matrix, the right matrix, and the basic block matrix, and the number of computation cores computed in parallel.
According to the sizes of the basic block matrix in three dimensions and the sizes of the left matrix and the right matrix to be calculated in three dimensions, the calculation region can be split to obtain an initial block, and the initial block can be split by taking the non-reduced dimensions M and N as split dimensions. Because the iteration number of each basic block matrix is equal to 1, the total iteration number, the initial block number and the initial block iteration number can be obtained according to the number of the split initial blocks and the number of the basic block matrixes in each initial block; and then obtaining the iteration times of each computing core according to the number of the computing cores of the parallel computing.
Specifically, the process may include the steps of:
step S4011, determining an initial block number, an iteration number of the initial block, and a total iteration number of the calculation region non-reduction dimension based on the sizes of the left matrix, the right matrix, and the basic block matrix.
The result matrix C (M rows and N columns) of the general matrix multiplication is obtained by multiplying a left matrix A (M rows and K columns) and a right matrix B (K rows and N columns), wherein the dimension of the basic block matrix corresponding to the left matrix A is as follows×/>The dimension of the basic block matrix corresponding to the right matrix B is as follows×/>. The total number of iterations of the generic matrix multiplication calculation can be derived from the size of the basic block matrix in three dimensions of M, N, K by:
wherein,,for the total number of iterations>Representing a rounding up operation.
Fig. 5 is an initial block diagram of a generic matrix multiplication computation region of some embodiments of the present application, assuming m=384, n=384, k=128, basic block matrices=128,/>=128,/>In the general matrix multiplication block method, as shown in fig. 5, the calculation area of the general matrix multiplication is divided into 384/128=3 parts in the non-reduction dimension M, N dimension, and the calculation area of the general matrix multiplication is divided into 128/4=32 parts in the K dimension (not shown). From M, N dimension, dividing the result matrix into 9 initial blocks with sequence numbers of 0-8 for small-scale general matrix multiplication calculation, and then performing memory rearrangement of data, wherein if the distribution is considered, 9 mutually independent calculation cores are needed for sharingAnd (5) performing time processing and completing calculation once. In practical applications, the number of computing cores for performing the block parallel computation may be smaller than the number of initial blocks, and the computation cannot be completed at one time.
From the above formula, the total iteration number=3×3×32=288, and the number of iterations per initial block is 32.
Step S4012, determining the number of iterations of the computational cores based on the number of computational cores and the total number of iterations of the parallel computation.
Assuming that the number of computation cores of the optimized parallel computation is 4, the number of iterations of each computation core is 288/4=72. According to the number of computing cores and the number of initial blocks of parallel computing, the initial block scheme cannot meet the requirement of completing all computing quantities in parallel at one time.
In step S402, in the case that the number of iterations of the computing core is greater than the number of iterations of the initial block, the merging computation and the splitting computation of the reduction dimension are performed on the initial block to form a target block, where the number of iterations of the target block is equal to the number of iterations of the computing core.
If the iteration number of the computing cores is greater than the iteration number of the initial blocks, that is, the number of the computing cores of the optimized parallel computing is smaller than the number of the initial blocks, the embodiment optimizes the initial block scheme by carrying out merging computing and splitting computing on the initial blocks in the protocol dimension to form the target blocks, so that the iteration number of the target blocks is equal to the iteration number of the computing cores, that is, the number of the target blocks is equal to the number of the computing cores of the parallel computing.
Still taking the embodiment in step S301 as an example, fig. 6 is a schematic diagram of target blocks of a general matrix multiplication computation area according to some embodiments of the present application, as shown in fig. 6, the computation cores of the optimized parallel computation are g1 to g4 respectively, and the target blocks corresponding to each computation core are shown as black bold boxes in the figure. For example, the target block corresponding to the computing core g1 includes an initial block 0, an initial block 1, and a part of an initial block 2; the target block corresponding to the computing core g2 includes a part of the initial block 2, the initial block 3, a part of the initial block 4, and the like. It should be noted that, the initial block split in fig. 6 is not split in M and N dimensions, but split in a reduction dimension K dimension (the split in the K dimension and the K dimension are not shown in fig. 5). The difference in area of the initial partitions 2, 4, 6 after being split in fig. 6 can be considered as the difference in the number of basic blocks split in the K dimension or the number of iterations.
Taking a target block corresponding to the computing core g1 as an example, dividing the initial blocks 0 and 1 by K dimensions respectively for 32 iterations, wherein the total number of the two initial blocks is 64 iterations; in addition, 8 iterations are split from the K-dimension block of the initial block 2, and the total number of iterations is 72, so that the number of iterations of the computing core g1 is satisfied; and by analogy, the initial blocks 0-8 are split and combined in sequence, so that each computing core can be distributed to a target block with the same iteration number as the computing cores, and the iteration number of the target block is equal to that of the computing cores.
The parallel execution strategy of the fixed split protocol dimension enables the access overhead of the universal matrix multiplied by the global to be irrelevant to the matrix scale, but to be increased along with the increase of the split number of the protocol dimension. On the other hand, when the number of the initial blocks is larger than the number of the parallel computing cores, each initial block is covered by the computing results of a plurality of computing cores, so that the pipeline parallel processing can be performed in the computing and accumulating process (the number of the computing cores of the operator cores is smaller than the number of the parallel computing cores), and the synchronous waiting time on the same initial block can be completely omitted.
Through the steps S401-S402, determining the iteration times of the calculation cores, the initial block number of the calculation areas and the iteration times of the initial blocks based on the sizes of the left matrix, the right matrix and the basic block matrix and the number of the calculation cores calculated in parallel, obtaining an initial block scheme of a result matrix, and determining whether the initial block scheme meets the requirement of completing all calculation amounts in parallel at one time; under the condition that the iteration times of the computing core are larger than those of the initial blocks, merging computation and splitting computation are carried out on the initial blocks to form target blocks, namely under the condition that the initial block scheme does not meet the requirement, the target block scheme is obtained by splitting the initial blocks again in the specification dimension, so that all computation amounts are completed in parallel at one time, and the utilization rate of hardware resources and the throughput of the computing core are improved.
In some embodiments, FIG. 7 is a flow chart of merging and splitting initial partitions to form target partitions according to some embodiments of the application, as shown in FIG. 7, the flow comprising the steps of:
step S701, sequentially performing iterative computation on each initial block based on the arrangement order of each initial block in the general matrix multiplication computation area.
In a specific program implementation, the outermost layer of the general matrix multiplication algorithm can be circularly unfolded and respectively placed on four computing cores g 1-g 4 for computation, the iterative computation of each computing core in each initial block is marked by the block index of the result matrix, and the starting index and the ending index of the actual multiplication accumulation computation are used as judging marks of whether the initial block is a partial sum or not. And sequentially executing iterative computation according to the arrangement sequence of the initial blocks 0-8 in the whole computation area, and determining the starting position and the ending position of the target block according to the starting index and the ending index of the computation core.
Step S702, when performing the merging calculation on the initial blocks, accumulating the iterative calculation values of at least two initial blocks that are merged.
Since the number of iterations per compute core is greater than the number of iterations per initial block, the partial sums of at least two adjacent initial blocks that are combined need to be accumulated, i.e., an additional "compute core number minus 1" number of memory accesses is needed to handle the partial sum accumulation due to spanning the initial blocks.
In step S703, when the initial block is split, the iteration calculation value of the first split block of the initial block is accumulated with the iteration calculation value of the previous initial block, and the iteration calculation value of the second split block is accumulated with the iteration calculation value of the next initial block.
According to the starting position and the ending position of the target block, the initial block which needs to be split is determined, and any initial block can be determined to be split into two blocks at most. Accumulating the iterative calculation value of the first block and the iterative calculation value of the previous initial block according to the order of the K dimension, and obtaining an accumulation result of the previous calculation core; and accumulating the iterative calculation value of the second block with the iterative calculation value of the next initial block to obtain an accumulation result of the next calculation core.
Through the steps S701-S703, iterative computation is sequentially performed on each initial block based on the arrangement sequence of each initial block in the general matrix multiplication computation area, so as to obtain the computation result of each basic block matrix, provide basic data for multiplication accumulation, and identify the start and end positions of the target block through indexes; the method comprises the steps of accumulating iteration calculation values of at least two initial blocks which are merged when the initial blocks are merged, accumulating the iteration calculation values of a first block which is split by the initial blocks and the iteration calculation values of a previous initial block when the initial blocks are split, accumulating the iteration calculation values of a second block which is split and the iteration calculation values of a next initial block, realizing a splitting strategy which takes the number of calculated cores as the number of splitting parts and multiply-accumulate calculation based on a target block, controlling total memory consumption from the whole, finishing linear mapping of matrix shapes on storage in a super-parameter mode, and avoiding operator performance reduction caused by cache memory access consumption fluctuation due to matrix shape difference.
In some embodiments, FIG. 8 is a flow chart of a method of optimizing the number of compute cores for parallel computing according to some embodiments of the application, as shown in FIG. 8, the flow comprising the steps of:
step S801, based on the sizes of the left matrix, the right matrix, and the size of the basic block matrix, the total iteration number is obtained.
According to the calculation formula in the foregoing embodiment, the total iteration number of the universal matrix multiplication may be calculated according to the sizes of the left matrix, the right matrix, and the size of the basic block matrix.
Step S802, optimizing the number of the preset parallel computing cores based on the total iteration times, the dimension of the general matrix multiplying operator kernel and the dimension of the basic block matrix, and obtaining an optimal value of the number of the computing cores.
The number of the computing cores of the parallel computing is a main parameter for optimizing the initial blocks of the computing area, and determines the number of target blocks, the iteration number of each target block and which basic block matrixes at positions are multiplied and accumulated. The multiply-accumulate process of each computing core needs to be matched with the reduction dimension of the operator core, otherwise, the splitting process of the reduction dimension may cause fragmentation computing problem and excessive memory consumption. Therefore, it may be determined whether the number of computation cores of the preset parallel computation matches the reduction dimension of the universal matrix multiplier kernel, and if not, the number of computation cores may be optimized according to the reduction dimension of the universal matrix multiplier kernel, the reduction dimension of the basic block matrix, and the total number of iterations until an optimal value of the number of computation cores matching the reduction dimension of the operator kernel is obtained.
Through the steps S801 to S802, the total iteration number is obtained based on the sizes of the left matrix, the right matrix and the basic block matrix, and is used as a parameter for judging whether the number of parallel computing cores is matched with the dimension of the protocol of the operator kernel; the method comprises the steps of optimizing the number of the preset parallel computing cores based on the total iteration times, the reduced dimension of the general matrix multiplication operator kernel and the reduced dimension of the basic block matrix to obtain an optimal value of the number of the computing cores, so that the multiplication accumulation process on each computing core is matched with the reduced dimension of the operator kernel, namely, the total-load computation of the operator kernel on the reduced dimension can be realized in each iteration for the part generated in the two iterations and the accumulated K dimension are integral multiples of the reduced dimension of the operator kernel, the fragmentation computation problem and access memory consumption caused by K dimension segmentation are greatly reduced, and the operator performance is improved.
In some embodiments, FIG. 9 is a flowchart of obtaining an optimal value for the number of compute cores according to some embodiments of the application, as shown in FIG. 9, the flowchart including the steps of:
in step S901, the number of iterations of the computation cores is determined based on the total number of iterations and the number of computation cores of the parallel computation set in advance.
According to the total iteration timesAnd the number g of parallel computing cores set in advance, the number of iterations ++for each computing core can be determined by the following equation>
In step S902, it is determined whether the number of iterations of the computation core satisfies a preset constraint condition, where the constraint condition is determined by a reduction dimension of the universal matrix multiplier kernel and a reduction dimension of the basic block matrix.
In the case of splitting the computation area using the reduction dimension, the total number of iterations is an integer multiple of the number of basic block matrices split in the reduction dimension. To maximize operator performance, the accumulated value of the dimension of the protocol dimension corresponding to each computing core needs to be an integer multiple of the dimension of the protocol dimension of the operator kernel. And setting constraint conditions according to the reduction dimension of the universal matrix multiplication operator kernel and the reduction dimension of the basic block matrix, and determining whether the iteration times of the computing core meet the constraint conditions. If yes, the number of the calculation cores is indicated to be the optimal value; if not, the number of the computing cores is adjusted, and then the iteration number of the computing cores is changed to meet the constraint condition.
Further, the constraint can be expressed by the following equation:
wherein,, For each computational core number of iterations, +.>Is the reduced dimension of the basic block matrix,for the dimension of the general matrix multiplication operator kernel, c is a predetermined scaling factor, n is a block number value of the calculation region split according to the dimension, and the value range of n is expressed by the following formula:
representing a rounding-up operation, c=1 in this embodiment.
At the position of、/>N, c and->In the case of known, the above equation is substituted to verify whether the values on both sides of the equation are equal. Where the value of n is a range of values, each n value is substituted into the equation for verification. If the equations are all satisfied, the number of computing cores g is determined to be an optimal value.
In step S903, the number of computing cores in parallel computing is adjusted until the number of iterations of the corresponding computing cores satisfies the constraint condition, if the constraint condition is not satisfied.
If the constraint condition is not satisfied, the number of the computing cores of the parallel computing is adjusted according to the hardware resource of the processor, and the corresponding iteration times are obtained by computingSubstituting the above equation for verification until the constraint condition is satisfied, and obtaining the optimal value of the corresponding calculation core number.
Through the steps S901-S903, determining the iteration times of the calculation cores by the number of the calculation cores based on the total iteration times and the preset number of the calculation cores of parallel calculation, and providing basic data for judging whether the number of the calculation cores is optimal or not in the follow-up steps; determining whether the number of the calculation cores is an optimal value by determining whether the iteration number of the calculation cores meets a preset constraint condition; under the condition that the constraint condition is not met, the number of calculation cores of parallel calculation is adjusted until the iteration times of the corresponding calculation cores meet the constraint condition, a calculation mode for optimizing the number of calculation cores is provided, the number of calculation cores after optimization can ensure that a multiplication accumulation process on each core is matched with the dimension of a protocol dimension of an operator core, namely, the dimension of the protocol generated by crossing an initial block and the dimension of the accumulated protocol are always integer times of the dimension of the protocol dimension of the operator core, so that the full-load calculation of the operator core on the protocol dimension can be realized in each iteration, the fragmentation calculation problem and access memory consumption caused by the protocol dimension segmentation are greatly reduced, and the performance of the operator is further improved.
In some embodiments, FIG. 10 is a flow chart of determining the size of a generic matrix multiplier kernel according to some embodiments of the present application, as shown in FIG. 10, comprising the steps of:
in step S1001, the data amount processed by each iteration of the computation core is determined based on the width of the arithmetic logic unit and the number of registers.
The data quantity which can be processed at one time when a single calculation core is operated can be determined according to the width of the arithmetic logic unit; from the product of the width of the arithmetic logic unit and the number of registers, the amount of data processed by a single iteration of the single computational core can be determined. The iterative process is that the computing core uses a plurality of registers to accumulate data, and completes a matrix multiplication iterative process.
Step S1002, determining the non-reduced dimension of the universal matrix multiplier kernel based on the number of computing cores and the data amount processed by each iteration of the computing cores.
Obtaining a combined operator according to the product of the number of the computing cores and the data quantity processed by each iteration of each computing coreThe non-reduced dimension of the kernel. For example, assuming that the number of registers that can be used in parallel is 4 and the width of the arithmetic logic unit is 8 for each computation core, the maximum computation amount in one loop iteration is 4 result matrices of 8×8. If the data volume processed by the matrix multiplier kernel formed by combining 4 computing cores with the number of 4,4 computing cores is 4 multiplied by 8, the non-reduced dimension of the matrix multiplier kernel can be determined ×/>=32×32。
If the left matrix is fully loaded on a single computing core (refer to one column of the left matrix in fig. 3), the result of the computation is one column of the result matrix core (corresponding to one column of the result matrix in fig. 3); similarly, if the right matrix is fully loaded on a single core (refer to one row of the right matrix in fig. 3), the calculation result is one row of the result matrix kernel (corresponding to one row of the result matrix in fig. 3). How to select the data loading mode depends on whether the storage mode of the matrix is stored in the column main sequence or the row main sequence, so that a large number of memory rearrangement operations can be avoided, the number of memory access operations is reduced, and the operator performance is improved.
In step S1003, the reduction dimension of the universal matrix multiplication operator kernel is determined based on the capacity of the data buffer corresponding to the left matrix and the right matrix and the non-reduction dimension.
The data cache capacity of a single compute core may be expressed by the following equation:
wherein Buf is the total buffer capacity available for data, buf o 、Buf l 、Buf r The buffer capacity of the result matrix, the left matrix and the right matrix are respectively indicated.
The reduction dimension of the operator kernel can be calculated by the following formula:
wherein,,the memory alignment function is used for rounding the variables; />、/>The non-reduced dimension of the operator kernel is multiplied by the universal matrix; buf (bus) l For the left matrix buffer capacity, buf r Buffer capacity for right matrix; />The dimension of the specification of the operator kernel is multiplied by the general matrix; s is the number of computing cores in the operator kernel; the two calculation modes respectively correspond to column main memory loading data and row main memory loading data.
Through the steps S1001-S1003, determining the data quantity of each iteration processing of the computing cores based on the width of the arithmetic logic unit and the number of registers, and obtaining the processing capacity parameter of each computing core; the method comprises the steps of determining the non-reduction dimension of the universal matrix multiplication operator kernel based on the number of calculation cores and the data quantity of each iteration processing of the calculation cores, determining the reduction dimension of the universal matrix multiplication operator kernel based on the capacity and the non-reduction dimension of data caches corresponding to a left matrix and a right matrix, determining the dimension of the operator kernel, and obtaining the maximum resource utilization rate in a multi-calculation core combination mode.
The present embodiment is described and illustrated below by way of preferred embodiments. FIG. 11 is a flow chart of a general matrix multiply computation optimization method in accordance with some preferred embodiments of the present application, the method being applied to a processor including at least one compute core including arithmetic logic units, data caches and registers. As shown in fig. 11, the flow includes the steps of:
Step S1101, determining the non-reduced dimension of the general matrix multiplier kernel based on the product of the width of the arithmetic logic unit in the computation core, the number of registers, and the predetermined number of computation cores for constituting the general matrix multiplier kernel;
step S1102, determining the reduced dimension of the kernel of the universal matrix multiplication operator based on the capacity of the data cache and the non-reduced dimension corresponding to the left matrix and the right matrix;
the reduction dimension of the operator kernel can be calculated by the following formula:
wherein,,the memory alignment function is used for rounding the variables; />、/>The non-reduced dimension of the operator kernel is multiplied by the universal matrix; buf (bus) l For the left matrix buffer capacity, buf r Buffer capacity for right matrix; />The dimension of the specification of the operator kernel is multiplied by the general matrix; s is the number of computing cores in the operator kernel; the two calculation modes respectively correspond to column main memory loading data and row main memory loading data.
Step S1103, obtaining total iteration times based on the sizes of the left matrix, the right matrix and the size of a predetermined basic block matrix;
step S1104, determining the iteration number of each computation core based on the total iteration number and the number of computation cores of the parallel computation set in advance;
Step S1105, determining whether the iteration number satisfies the constraint condition of the following equation:
wherein,,to calculate the iteration number of the core, +.>Is the reduced dimension of the basic block matrix, < +.>For the dimension of the general matrix multiplication operator kernel, c is a predetermined scaling factor, n is a block number value of the calculation region according to the block of the dimension, and the value range of n is expressed by the following formula:
step S1106, if the constraint condition is satisfied, determining the number of the computation cores corresponding to the iteration number as an optimal value;
step S1107, if the constraint condition is not met, adjusting the number of the calculation cores until the corresponding iteration times meet the constraint condition, and obtaining an optimal value of the number of the calculation cores;
step S1108, determining the initial block number and the iteration number of the initial block of the non-reduction dimension of the calculation region based on the sizes of the left matrix, the right matrix and the basic block matrix;
step S1109, determining the iteration number of each computing core based on the optimal value of the number of computing cores and the total iteration number;
step S1110, under the condition that the iteration times of the computing core are larger than the iteration times of the initial blocks, sequentially carrying out iterative computation on each initial block and accumulating the iteration times based on the arrangement sequence of each initial block in the general matrix multiplication computation area;
Step S1111, accumulating the iterative calculation results of each initial partitioning protocol dimension when the accumulated iterative times of each initial partitioning protocol dimension is equal to the iterative times of the calculation core, obtaining the iterative calculation result of the target partitioning, and re-counting the accumulated iterative times;
step S1112, repeating steps S1110-S1111 until all the initial blocks complete the result accumulation of the iterative computation, and obtaining the iterative computation result of each target block.
Through the steps S1101 to S1112, determining the size of the operator core according to the processing capability parameter of each computing core and the number of computing cores used for combining into a universal matrix multiplying operator core, and obtaining the maximum resource utilization rate in a multi-computing core combination mode; determining whether the number of given parallel computing cores is an optimal value or not through constraint condition equations, and adjusting the number of the parallel computing cores until the optimal value of the number of the computing cores is obtained under the condition that the number of the parallel computing cores is not the optimal value, wherein the optimized number of the computing cores can ensure that a multiplication accumulation process on each core is matched with the reduction dimension of an operator core, so that fragmentation computing problems and access memory consumption caused by reduction dimension segmentation are greatly reduced, and the performance of the operator is improved; the method has the advantages that the method realizes the splitting strategy with the number of calculated cores as the splitting number and the multiply-accumulate calculation based on the target blocks through the protocol dimension splitting of the initial blocks, overall memory consumption is controlled globally, linear mapping of matrix shapes on storage is completed in a super-parameter mode, and operator performance degradation caused by cache memory access consumption fluctuation due to matrix shape difference is avoided.
It should be noted that the steps illustrated in the above-described flow or flow diagrams of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order other than that illustrated herein.
In some embodiments, the present application also provides a general matrix multiplication computation optimization apparatus applied to a processor including at least one computation core including an arithmetic logic unit, a data cache, and a register. The general matrix multiplication calculation optimizing device is used for realizing the above embodiments and preferred embodiments, and is not described in detail. The terms "module," "unit," "sub-unit," and the like as used below may refer to a combination of software and/or hardware that performs a predetermined function. In some embodiments, fig. 12 is a block diagram of the general matrix multiplication computation optimization apparatus of the present embodiment, as shown in fig. 12, including:
a determining module 1201, configured to determine a size of the universal matrix multiplier kernel based on a width of the arithmetic logic unit, a number of registers, a capacity of a data buffer, and a predetermined number of computation cores for forming the universal matrix multiplier kernel;
A first optimizing module 1202, configured to optimize the number of computation cores for parallel computation based on the size of the kernel of the universal matrix multiplication operator, the size of the predetermined basic block matrix, and the sizes of the left matrix and the right matrix to be subjected to universal matrix multiplication;
the second optimizing module 1203 is configured to optimize the block calculation of the general matrix multiplication calculation area in the data cache based on the number of calculation cores of the parallel calculation, the size of the basic block matrix, and the sizes of the left matrix and the right matrix.
The general matrix multiplication optimization device of the embodiment determines the size of the kernel of the general matrix multiplication operator through the determining module 1201, logically widens the width of an arithmetic logic unit, and improves the utilization rate of hardware computing resources through combining computing resources; optimizing the number of computing cores of parallel computing through a first optimization module 1202, and ensuring that a multiply-accumulate process on each computing core of the parallel computing is matched with a protocol dimension of a combined operator core, so that full-load computing of the operator core on the protocol dimension can be realized in each iteration, and fragmentation computing problems and memory consumption caused by segmentation are reduced; the second optimizing module 1203 optimizes the block calculation of the general matrix multiplication calculation area in the data cache, thereby improving the throughput of the calculation core, avoiding the waste of calculation resources, improving the access efficiency of the data cache and the register, and solving the problems of lower utilization rate of general matrix multiplication calculation hardware resources and higher data access cost in the related technology.
In some embodiments, the second optimization module includes a first determination submodule and a calculation submodule, where the first determination submodule is configured to determine the number of iterations of the calculation cores, the number of initial blocks of the calculation region non-reduction dimension, and the number of iterations of the initial blocks based on the sizes of the left matrix, the right matrix, the size of the basic block matrix, and the number of calculation cores calculated in parallel; the computing sub-module is used for carrying out combination computation and split computation of the reduction dimension on the initial block under the condition that the iteration times of the computing core are larger than the iteration times of the initial block so as to form a target block, and the iteration times of the target block are equal to the iteration times of the computing core.
The general matrix multiplication computation optimization device of the embodiment determines the iteration times of a computation core, the initial block number of computation areas and the iteration times of the initial blocks through a first determination submodule to obtain an initial block scheme of a result matrix, and determines whether the initial block scheme meets the requirement of completing all computation amounts in parallel at one time; under the condition that the iteration times of the computing cores are larger than the iteration times of the initial blocks, the computing sub-modules conduct combination computing and splitting computing on the initial blocks to form target blocks, namely under the condition that the initial block scheme does not meet the requirement, the target block scheme is obtained through splitting the initial blocks again in the specification dimension, all computing quantities are completed in parallel at one time, and the utilization rate of hardware resources and the throughput of the computing cores are improved.
In some embodiments, the computing submodule includes an iteration module, a first accumulation module and a second accumulation module, where the iteration module is used to sequentially perform iterative computation on each initial block based on the arrangement sequence of each initial block in the general matrix multiplication computation area; the first accumulation module is used for accumulating iterative calculation values of at least two combined initial blocks when the initial blocks are combined and calculated; the second accumulation module is used for accumulating the iteration calculation value of the first split block of the initial split block with the iteration calculation value of the former initial split block and accumulating the iteration calculation value of the second split block with the iteration calculation value of the latter initial split block when the initial split block is split and calculated.
According to the general matrix multiplication calculation optimization device, iterative calculation is sequentially carried out on each initial block through the iterative module, a calculation result of each basic block matrix is obtained, basic data are provided for multiplication accumulation, and the starting position and the ending position of a target block are marked through indexes; the method comprises the steps of accumulating iterative calculation values of at least two combined initial blocks through a first accumulation module, accumulating the iterative calculation values of a first block, which is split by the initial blocks, with the iterative calculation values of a previous initial block through a second accumulation module, accumulating the iterative calculation values of a second block, which is split, with the iterative calculation values of a next initial block, realizing a splitting strategy taking the number of calculation cores as the splitting number and performing multiply accumulation calculation based on a target block, controlling the total memory consumption from the whole, finishing linear mapping of matrix shapes on storage in a super-parameter mode, and avoiding operator performance reduction caused by cache access consumption fluctuation due to matrix shape difference.
In some embodiments, the first optimization module includes an acquisition sub-module and an optimization sub-module, where the acquisition sub-module is configured to acquire a total iteration number based on a size of the left matrix, a size of the right matrix, and a size of the basic block matrix; the optimization submodule is used for optimizing the number of the preset parallel computing cores based on the total iteration times, the reduction dimension of the universal matrix multiplying operator kernel and the reduction dimension of the basic block matrix, and obtaining an optimal value of the number of the computing cores.
The general matrix multiplication computation optimization device of the embodiment obtains the total iteration times through the obtaining submodule, and the total iteration times are used as parameters for judging whether the number of the parallel computation cores is matched with the dimension of the protocol of the operator kernel; the optimization sub-module is used for optimizing the number of the preset parallel computing cores to obtain an optimal value of the number of the computing cores, so that the multiplication accumulation process on each computing core is matched with the protocol dimension of the operator core, namely, the part generated in two iterations and the accumulated K dimension are always integral multiples of the protocol dimension of the operator core, so that the full-load computation of the operator core on the protocol dimension can be realized in each iteration, the fragmentation computation problem and memory consumption caused by K dimension segmentation are greatly reduced, and the operator performance is further improved.
In some embodiments, the optimization submodule includes a third determining unit, a fourth determining unit and an adjusting unit, where the third determining unit is configured to determine the number of iterations of the computation cores based on the total number of iterations and a preset number of computation cores of the parallel computation; the fourth determining unit is used for determining whether the iteration times of the computing core meet preset constraint conditions, wherein the constraint conditions are determined by the reduction dimension of the general matrix multiplying operator kernel and the reduction dimension of the basic block matrix; the adjusting unit is used for adjusting the number of the computation cores of the parallel computation under the condition that the constraint condition is not met until the iteration times of the corresponding computation cores meet the constraint condition.
The general matrix multiplication calculation optimizing device of the embodiment determines the iteration times of the calculation cores through the third determining unit and provides basic data for judging whether the number of the calculation cores is optimal or not; determining whether the iteration number of the computing cores meets a preset constraint condition or not through a fourth determining unit, and determining whether the number of the computing cores is an optimal value or not; under the condition that the constraint condition is not met, the adjustment unit adjusts the number of calculation cores of parallel calculation until the iteration times of the corresponding calculation cores meet the constraint condition, a calculation mode for optimizing the number of calculation cores is provided, the number of calculation cores after optimization can ensure that the part and accumulated protocol dimension generated by crossing an initial block are always integral multiples of the protocol dimension of an operator kernel, so that full-load calculation of the operator kernel on the protocol dimension can be realized in each iteration, the fragmentation calculation problem and memory consumption caused by protocol dimension segmentation are greatly reduced, and the operator performance is further improved.
In some embodiments, the determining module includes a second determining submodule, a third determining submodule and a fourth determining submodule, where the second determining submodule is used for determining the data amount processed by the computing core in each iteration based on the width of the arithmetic logic unit and the number of registers; the third determining submodule is used for determining the non-reduced dimension of the universal matrix multiplying operator kernel based on the number of the computing cores and the data quantity processed by each iteration of the computing cores; the fourth determining submodule is used for determining the reduction dimension of the universal matrix multiplying operator kernel based on the capacity of the data buffer corresponding to the left matrix and the right matrix and the non-reduction dimension.
The general matrix multiplication calculation optimization device of the embodiment determines the data volume of each iteration processing of the calculation cores through the second determination submodule to obtain the processing capacity parameter of each calculation core; the non-reduction dimension of the universal matrix multiplication operator core is determined through a third determination submodule, the reduction dimension of the universal matrix multiplication operator core is determined through a fourth determination submodule, the dimension of the operator core is determined, and the maximum resource utilization rate is obtained through a multi-calculation core combination mode.
In addition, the present embodiment also provides a processor, which includes the general matrix multiplication computation optimization device in the above embodiment, and at least one computation core, where the computation core includes an arithmetic logic unit, a data cache, and a register.
The processor of the embodiment optimizes the calculation process of the universal matrix multiplication of at least one calculation core through the universal matrix multiplication calculation optimization device, and comprises the steps of determining the size of the inner cores of the universal matrix multiplication operator, optimizing the number of the calculation cores of parallel calculation and the block calculation process, so that the throughput of the calculation cores is improved, the waste of calculation resources is avoided, and the access efficiency of the data cache and the register is improved.
It should be noted that, specific examples in this embodiment may refer to examples described in the foregoing embodiments and alternative implementations, and are not described in detail in this embodiment.
It should be understood that the specific embodiments described herein are merely illustrative of this application and are not intended to be limiting. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure in accordance with the embodiments provided herein.
It is to be understood that the drawings are merely illustrative of some embodiments of the present application and that it is possible for those skilled in the art to adapt the present application to other similar situations without the need for inventive work. In addition, it should be appreciated that while the development effort might be complex and lengthy, it will nevertheless be a routine undertaking of design, fabrication, or manufacture for those of ordinary skill having the benefit of this disclosure, and further having the benefit of this disclosure.
The term "embodiment" in this disclosure means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive. It will be clear or implicitly understood by those of ordinary skill in the art that the embodiments described in the present application can be combined with other embodiments without conflict.
The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the patent claims. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims (10)

1. A method of general matrix multiply computation optimization, the method being applied to a processor, the processor comprising at least one computational core, the computational core comprising an arithmetic logic unit, a data cache and registers, the method comprising:
determining the size of a general matrix multiplier kernel based on the width of the arithmetic logic unit, the number of registers, the capacity of the data cache, and a predetermined number of computational cores for constructing the general matrix multiplier kernel;
optimizing the number of computing cores of parallel computing based on the size of the kernel of the universal matrix multiplication operator, the size of a predetermined basic block matrix, and the sizes of a left matrix and a right matrix to be subjected to universal matrix multiplication;
and optimizing the block calculation of the general matrix multiplication calculation area in the data cache based on the number of the calculation cores of the parallel calculation, the size of the basic block matrix and the sizes of the left matrix and the right matrix.
2. The general matrix multiplication optimization method according to claim 1, wherein optimizing the block calculation of the general matrix multiplication calculation region in the data cache based on the number of calculation cores of the parallel calculation, the size of the basic block matrix, and the sizes of the left matrix and the right matrix includes:
Determining the iteration times of the computing cores, the initial block number of the non-reduction dimension of the computing area and the iteration times of the initial blocks based on the sizes of the left matrix, the right matrix and the basic block matrix and the number of the computing cores of the parallel computing;
and under the condition that the iteration times of the computing cores are larger than the iteration times of the initial blocks, carrying out combination computation and split computation of the reduction dimensions on the initial blocks to form target blocks, wherein the iteration times of the target blocks are equal to the iteration times of the computing cores.
3. The method of claim 2, wherein performing a combination calculation and a split calculation of a reduction dimension on the initial block to form a target block comprises:
sequentially performing iterative computation on each initial block based on the arrangement sequence of each initial block in the general matrix multiplication computation area;
when the initial blocks are combined and calculated, accumulating iterative calculation values of at least two combined initial blocks;
and when the initial block is split, accumulating the iterative calculation value of the first split block of the initial block with the iterative calculation value of the former initial block, and accumulating the iterative calculation value of the second split block with the iterative calculation value of the latter initial block.
4. The general matrix multiplication computation optimization method according to claim 2, wherein the determining the number of iterations of the computation cores, the number of initial blocks of the computation region non-reduction dimension, and the number of iterations of the initial blocks based on the sizes of the left matrix, the right matrix, the basic block matrix, and the number of computation cores of the parallel computation includes:
determining the initial block number, the iteration number and the total iteration number of the initial block of the non-reduction dimension of the calculation region based on the sizes of the left matrix, the right matrix and the basic block matrix;
and determining the iteration times of the computation cores based on the number of the computation cores of the parallel computation and the total iteration times.
5. The method according to claim 1, wherein optimizing the number of computation cores of parallel computation based on the size of the kernel of the universal matrix multiplication operator, the size of a predetermined basic block matrix, and the sizes of left and right matrices to be subjected to universal matrix multiplication comprises:
acquiring total iteration times based on the sizes of the left matrix, the right matrix and the basic block matrix;
And optimizing the number of the preset parallel computing cores based on the total iteration times, the reduction dimension of the universal matrix multiplying operator kernel and the reduction dimension of the basic block matrix to obtain an optimal value of the number of the computing cores.
6. The method of optimizing computation of a universal matrix multiplier according to claim 5, wherein optimizing the number of computation cores of the parallel computation set in advance based on the total number of iterations, the reduction dimension of the universal matrix multiplier kernel, and the reduction dimension of the basic block matrix, to obtain an optimal value of the number of computation cores comprises:
determining the iteration times of the computing cores based on the total iteration times and the preset number of the computing cores of parallel computing;
determining whether the iteration times of the computing core meet preset constraint conditions or not, wherein the constraint conditions are determined by the dimension of the general matrix multiplying operator kernel and the dimension of the basic block matrix;
and under the condition that the constraint condition is not met, adjusting the number of the computing cores of the parallel computing until the iteration times of the corresponding computing cores meet the constraint condition.
7. The general matrix multiplication optimization method of claim 6, wherein the constraints are represented by the following equations:
ck k =perkernel-K BLK
wherein, perkernel is the iteration number of the computation core, K BLK For the reduced dimension, k, of the basic block matrix k And c is a predetermined scaling factor for the dimension of the general matrix multiplication operator kernel, n is a value of the calculation region according to the block number of the block of the dimension, and the value range of n is expressed by the following formula:
8. the method of claim 1, wherein the determining the size of the generic matrix multiplier kernel based on the width of the arithmetic logic unit, the number of registers, the capacity of the data cache, and a predetermined number of computational cores for constructing the generic matrix multiplier kernel comprises:
determining the data quantity processed by the computing core in each iteration based on the width of the arithmetic logic unit and the number of registers;
determining the non-reduced dimension of the universal matrix multiplier kernel based on the number of computing cores and the data volume processed by each iteration of the computing cores;
And determining the reduction dimension of the universal matrix multiplication operator kernel based on the capacity of the data cache corresponding to the left matrix and the right matrix and the non-reduction dimension.
9. A general purpose matrix multiply computing optimization apparatus for use with a processor, the processor comprising at least one computing core, the computing core comprising an arithmetic logic unit, a data cache, and registers, the apparatus comprising:
a determining module, configured to determine a size of a universal matrix multiplier kernel based on a width of the arithmetic logic unit, a number of registers, a capacity of the data cache, and a predetermined number of computation cores configured to form the universal matrix multiplier kernel;
the first optimizing module is used for optimizing the number of calculation cores of parallel calculation based on the size of the kernel of the universal matrix multiplication operator, the size of a predetermined basic block matrix, and the sizes of a left matrix and a right matrix to be subjected to universal matrix multiplication;
and the second optimization module is used for optimizing the block calculation of the general matrix multiplication calculation area in the data cache based on the number of the calculation cores of the parallel calculation, the size of the basic block matrix and the sizes of the left matrix and the right matrix.
10. A processor comprising the generic matrix multiply computing optimization apparatus of claim 9, and at least one computing core comprising arithmetic logic units, data caches, and registers.
CN202311078065.3A 2023-08-25 2023-08-25 General matrix multiplication calculation optimization method, device and processor Active CN116881618B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311078065.3A CN116881618B (en) 2023-08-25 2023-08-25 General matrix multiplication calculation optimization method, device and processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311078065.3A CN116881618B (en) 2023-08-25 2023-08-25 General matrix multiplication calculation optimization method, device and processor

Publications (2)

Publication Number Publication Date
CN116881618A true CN116881618A (en) 2023-10-13
CN116881618B CN116881618B (en) 2024-06-04

Family

ID=88264797

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311078065.3A Active CN116881618B (en) 2023-08-25 2023-08-25 General matrix multiplication calculation optimization method, device and processor

Country Status (1)

Country Link
CN (1) CN116881618B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118012628A (en) * 2024-03-15 2024-05-10 北京壁仞科技开发有限公司 Data processing method, device and storage medium

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5099447A (en) * 1990-01-22 1992-03-24 Alliant Computer Systems Corporation Blocked matrix multiplication for computers with hierarchical memory
CN101485105A (en) * 2006-05-22 2009-07-15 诺基亚公司 Lower complexity computation of lattice reduction
US20170344514A1 (en) * 2016-05-31 2017-11-30 Palo Alto Research Center Incorporated System and method for speeding up general matrix-matrix multiplication on the gpu
CN108509270A (en) * 2018-03-08 2018-09-07 中国科学院软件研究所 The high performance parallel implementation method of K-means algorithms on a kind of domestic 26010 many-core processor of Shen prestige
CN109086244A (en) * 2018-07-11 2018-12-25 中国人民解放军国防科技大学 Matrix convolution vectorization implementation method based on vector processor
CN112069460A (en) * 2020-09-18 2020-12-11 Oppo广东移动通信有限公司 Data processing method and device and electronic equipment
CN113987414A (en) * 2021-11-03 2022-01-28 中国人民解放军国防科技大学 Small and irregular matrix multiplication optimization method based on ARMv8 multi-core processor
CN114090954A (en) * 2021-11-08 2022-02-25 湖南大学 Integer matrix multiplication kernel optimization method based on FT-2000+
CN114707114A (en) * 2022-04-25 2022-07-05 上海壁仞智能科技有限公司 Blocking method and device, convolution operation method and device, and storage medium
CN115130057A (en) * 2022-06-13 2022-09-30 中国电子科技集团公司第十四研究所 FPGA-based universal matrix correlation calculation implementation system and method
CN115408061A (en) * 2022-11-02 2022-11-29 北京红山微电子技术有限公司 Hardware acceleration method, device, chip and storage medium for complex matrix operation
CN115713101A (en) * 2021-08-19 2023-02-24 杭州海康威视数字技术股份有限公司 Parallel processing method and device
CN116306840A (en) * 2021-12-03 2023-06-23 中兴通讯股份有限公司 Neural network operation method, device, chip, electronic equipment and storage medium
CN116401502A (en) * 2023-06-09 2023-07-07 之江实验室 Method and device for optimizing Winograd convolution based on NUMA system characteristics

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5099447A (en) * 1990-01-22 1992-03-24 Alliant Computer Systems Corporation Blocked matrix multiplication for computers with hierarchical memory
CN101485105A (en) * 2006-05-22 2009-07-15 诺基亚公司 Lower complexity computation of lattice reduction
US20170344514A1 (en) * 2016-05-31 2017-11-30 Palo Alto Research Center Incorporated System and method for speeding up general matrix-matrix multiplication on the gpu
CN108509270A (en) * 2018-03-08 2018-09-07 中国科学院软件研究所 The high performance parallel implementation method of K-means algorithms on a kind of domestic 26010 many-core processor of Shen prestige
CN109086244A (en) * 2018-07-11 2018-12-25 中国人民解放军国防科技大学 Matrix convolution vectorization implementation method based on vector processor
CN112069460A (en) * 2020-09-18 2020-12-11 Oppo广东移动通信有限公司 Data processing method and device and electronic equipment
CN115713101A (en) * 2021-08-19 2023-02-24 杭州海康威视数字技术股份有限公司 Parallel processing method and device
CN113987414A (en) * 2021-11-03 2022-01-28 中国人民解放军国防科技大学 Small and irregular matrix multiplication optimization method based on ARMv8 multi-core processor
CN114090954A (en) * 2021-11-08 2022-02-25 湖南大学 Integer matrix multiplication kernel optimization method based on FT-2000+
CN116306840A (en) * 2021-12-03 2023-06-23 中兴通讯股份有限公司 Neural network operation method, device, chip, electronic equipment and storage medium
CN114707114A (en) * 2022-04-25 2022-07-05 上海壁仞智能科技有限公司 Blocking method and device, convolution operation method and device, and storage medium
CN115130057A (en) * 2022-06-13 2022-09-30 中国电子科技集团公司第十四研究所 FPGA-based universal matrix correlation calculation implementation system and method
CN115408061A (en) * 2022-11-02 2022-11-29 北京红山微电子技术有限公司 Hardware acceleration method, device, chip and storage medium for complex matrix operation
CN116401502A (en) * 2023-06-09 2023-07-07 之江实验室 Method and device for optimizing Winograd convolution based on NUMA system characteristics

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
XIUHONG LI ET.AL: "A Coordinated Tiling and Batching Framework for Efficient GEMM on GPUs", 《2019 ASSOCIATION FOR COMPUTING MACHINERY》, 31 December 2019 (2019-12-31), pages 229 - 241 *
刘仲等: "面向多核向量处理器的矩阵乘法向量化方法", 计算机学报, no. 10, 30 June 2017 (2017-06-30), pages 79 - 92 *
王聪等: "车联网中基于NDN的缓存策略综述", 《物联网》, 31 December 2019 (2019-12-31), pages 126 - 142 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118012628A (en) * 2024-03-15 2024-05-10 北京壁仞科技开发有限公司 Data processing method, device and storage medium
CN118012628B (en) * 2024-03-15 2024-11-01 北京壁仞科技开发有限公司 Data processing method, device and storage medium

Also Published As

Publication number Publication date
CN116881618B (en) 2024-06-04

Similar Documents

Publication Publication Date Title
KR102443546B1 (en) matrix multiplier
CN108241890B (en) Reconfigurable neural network acceleration method and architecture
CN108805266B (en) Reconfigurable CNN high-concurrency convolution accelerator
CN106846235B (en) Convolution optimization method and system accelerated by NVIDIA Kepler GPU assembly instruction
CN116881618B (en) General matrix multiplication calculation optimization method, device and processor
CN112668708B (en) Convolution operation device for improving data utilization rate
CN114092336B (en) Image scaling method, device, equipment and medium based on bilinear interpolation algorithm
CN113835758B (en) Winograd convolution implementation method based on vector instruction accelerated computation
CN109993293B (en) Deep learning accelerator suitable for heap hourglass network
CN116301920B (en) Compiling system for deploying CNN model to high-performance accelerator based on FPGA
JP2023513608A (en) Address generation method and unit, deep learning processor, chip, electronic device and computer program
CN111079078A (en) Lower triangular equation parallel solving method for structural grid sparse matrix
CN113987414B (en) Small and irregular matrix multiplication optimization method based on ARMv8 multi-core processor
CN116710912A (en) Matrix multiplier and control method thereof
CN113313247A (en) Operation method of sparse neural network based on data flow architecture
CN112488296A (en) Data operation method, device, equipment and storage medium based on hardware environment
CN110414672B (en) Convolution operation method, device and system
US20230252600A1 (en) Image size adjustment structure, adjustment method, and image scaling method and device based on streaming architecture
CN116258185A (en) Processor, convolution network computing method with variable precision and computing equipment
CN113128688B (en) General AI parallel reasoning acceleration structure and reasoning equipment
CN114117896A (en) Method and system for realizing binary protocol optimization for ultra-long SIMD pipeline
CN115081600A (en) Conversion unit for executing Winograd convolution, integrated circuit device and board card
CN115081603A (en) Computing device, integrated circuit device and board card for executing Winograd convolution
CN112434255A (en) Vector-matrix operation and data processing method, multiplier and processor chip
CN113888390A (en) Feature map processing method and device, electronic equipment and computer readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant