CN117828252A

CN117828252A - High-performance matrix vector multiplication method based on matrix core

Info

Publication number: CN117828252A
Application number: CN202311438380.2A
Authority: CN
Inventors: 陆璐; 钟昊阳
Original assignee: Shenzhen Aitesi Information Technology Co ltd
Current assignee: Shenzhen Aitesi Information Technology Co ltd
Priority date: 2023-10-31
Filing date: 2023-10-31
Publication date: 2024-04-05

Abstract

The method discloses a matrix vector multiplication acceleration method and device based on a deep learning accelerator matrix core. The method comprises the following steps: according to the calculation scale, distributing tasks to a certain number of blocks, and balancing the load of a processor core; and converting the original format of the matrix by using the LDS and reading the original format into a storage unit corresponding to the input of the matrix core, thereby providing a scheme for avoiding the conflict of the storage body. Matrix vector multiplication within a block is completed using the narrowest granularity mma instruction in the processor architecture matrix core; by using the pre-fetching and double-buffering technology, the most time-consuming access HBM operation in the whole task can be continued, and meanwhile, structural conflict and data conflict in a pipeline are avoided; the calculation result of each block is written into the HBM by atomic accumulation of hardware. According to the invention, matrix vector multiplication operation can be performed by using the matrix core in the deep learning processor with high performance, and the deep learning model can be effectively accelerated in an actual application scene.

Description

High-performance matrix vector multiplication method based on matrix core

Technical Field

The invention relates to the field of high-performance calculation and deep learning acceleration, in particular to a high-performance matrix vector multiplication method based on a matrix core in deep learning.

Background

The computation in deep learning is supported primarily by deep learning accelerators, such as GPUs, NPUs, TPUs, and the like. Such accelerators all have similar large-scale data parallel computing capabilities, and the CPU makes up a heterogeneous system. Data needs to be carried from memory to the HBM memory space exclusive to the processor at the time of computation. In addition to programming the invisible L1, L2 caches, the processor is usually provided with programmable cache units, such as LDS shared by threads in a workgroup on a GPU, registers private to threads, etc.

The deep learning has a large number of matrix vector multiplication calculation scenes, and the calculation memory ratio of the matrix multiplication is far greater than that of the matrix vector multiplication obtained by degradation, so that the calculation performance of the matrix vector multiplication is usually limited by the bandwidth of a processor, and the calculation unit has a large amount of time to be idle due to waiting for memory. Deep learning has stringent requirements on energy consumption and resource utilization.

Deep learning accelerators in recent years have typically a Matrix core to accelerate Matrix computation, such as the Tensor core of Nvidia, the Matrix core of AMD, which is a rising cube. These matrix cores are typically used for half-precision matrix multiplication in deep learning and the theoretical peak performance can be at least 8 times that of the vector processing unit. However, in deep learning, in particular, half-precision matrix vector multiplication is often also largely dependent on the vector calculation unit. Since matrix vector multiplication original data reading method is intuitively not suitable for matrix core calculation, and the memory space occupied by two inputs is huge. Matrix vector multiplication is a special case of matrix multiplication, how to accelerate matrix vector multiplication with high computational power of matrix kernels, and making full use of multi-level memory cells and computational power of deep learning accelerators is a challenge currently faced.

Disclosure of Invention

In order to solve the problems, the invention provides a matrix vector multiplication acceleration method based on a matrix core of a deep learning accelerator, which effectively utilizes high computation power of the matrix core and multi-stage storage units on the deep learning accelerator, and greatly reduces computation delay of matrix vector multiplication.

Matrix multiplication form: y=a×b, where a is MxK in dimension, kxN in dimension, and MxN in dimension.

Matrix vector multiplication form: y=v×m or y=m×v, M and V representing a matrix and a vector, respectively, wherein the multiplication operation needs to conform to an operation rule of matrix multiplication, and the object dimension description on the left of the multiplication number coincides with a, and the object dimension description on the right of the multiplication number coincides with B. FIG. 1 illustrates the architectural features of a deep learning accelerator, both GPU and N PU conforming to this computing architecture.

In logic division of tasks into different blocks, a plurality of physical processors exist on the deep learning accelerator, each processor selects one unexecuted block from a scheduling queue for processing during execution, the processors are mutually parallel, and the internal of the processors adopts a SIMD method for parallel computation. FIG. 2 illustrates a matrix core execution model on a processor.

The matrix core corresponds to mma related instructions provided by the ISA in the hardware instruction set, the mma instruction reads two input matrixes with fixed dimensions at the fragmentA and the fragmentB, the input matrixes are processed by the matrix core computing unit to obtain output fragmentC and accumulated on the fragmentD, and the matrix format arrangement in the fragments is strictly limited according to different computing architectures.

The invention discloses a matrix vector multiplication acceleration method based on a deep learning accelerator matrix core, which comprises the following steps:

and (5) constructing a task grid. Dividing matrix blocks with the size of TILE_K multiplied by TILE_N by taking a matrix in matrix vector multiplication as a basis, wherein each block is only read or written once by blocks, and the blocks are respectively scheduled to be executed on different processors. There are many processors in the deep learning accelerator that can compute in parallel, these processors do not interfere with each other, perform their respective tasks independently, and the only interactable way is through the HBM space off-chip. Two factors affecting overall computing performance are mainly involved in this step, including in particular:

setting the number of atomic accumulation, different blocks will be executed by different processors, and the blocks in the accumulation direction will accumulate the calculation result to the same HBM storage area, so the number of blocks in the direction, splict_k, determines the total number of atomic accumulation, so the smaller splict_k is, the better.

Setting the number of transmissions of the block, generally increasing the number of transmissions of the block, increases the parallelism, but preferably is as close as possible to an integer multiple of the total number of processors. This may allow the computation amount assigned to each processor to be balanced, ensuring that the tasks of each processor can be completed simultaneously. The number of processors is sm_num, the shape of the matrix is kxn, and the optimal number of transmissions of the block can be determined to minimize the loss L in equation 3.

L＝SM_NUM-((SPLICT_K×N_TILE_NUM)％SM_NUM) (3)

When the type and configuration of the neural network are determined, the types of the shapes calculated by matrix vector multiplication are very few, so that the execution performance of preselected schemes is enumerated based on formulas 1 and 2 for a specific network at compile time, so that the optimal partitioning scheme under the current network is obtained.

And setting the computing mode and data distribution of matrix vector multiplication in the block according to the matrix core supported by the framework.

And performing task division on the work groups in the blocks. The accumulation operation is involved, and the working groups in each block need to process independent results, so that communication among different working groups is avoided. The method comprises the steps of dividing the working groups in one dimension, determining the number of the working groups according to the number of matrix cores in a processor, and enabling each working group to call one matrix core to finish calculation.

Optionally, each block reads the whole vector to be processed from the HBM and resides on the LDS, where the LDS space is large enough to reduce the repeated reading overhead, and the number of blocks is set to be equal to the number of processors, and the matrix block index to be processed inside the block is controlled by the program.

Each block creates double buffering for instruction parallelism, two storage areas are buffer_1 and buffer_2 respectively, and buffer_1 and buffer_2 are alternately used for iteratively calculating matrix vector multiplication of the blocks and accumulating on a chip.

And prefetching data, namely prefetching a first block matrix block which is responsible for processing into a buffer_1 by each working group, and updating the first address of the matrix block in the HBM to jump to the position to be processed of the next block.

Optionally, the data needs to be rearranged in the LDS to meet the format requirement of the matrix core on the input data, and at the same time, the storage bank conflict in the LDS is avoided or the alignment access requirement in the LDS is adapted in an optimized arrangement manner. The method specifically comprises the following steps:

and reading the SIMD instruction to the LDS in a coarse granularity according to the original format of the matrix in the HBM so as to fully utilize the merged memory transaction.

The original format is converted by using SIMD instructions according to the format required by fragment, and when multiple threads possibly access the same memory bank on the LDS, padding is added at the conflict position, so that two threads which originally access the same memory bank and cause conflict can access different memory banks.

Each block iteration completes the processing of the corresponding matrix block, and specifically comprises the following steps:

in the cyclic body, if a matrix to be processed exists, each working group takes the next matrix block in charge of processing into a buffer_2, and updates the first address of the matrix block in the HBM to jump to the position of the next block to be processed.

Inside the torus, vectors are loaded from the HBM into the input memory cells of the matrix core. Optionally, the portion to be processed is read from the portion preloaded into the LDS. A synchronization checkpoint of the block internal workgroup is added here to synchronize all threads within the blcok.

And calling the matrix core to complete matrix vector multiplication of the blocks in the cyclic body, and temporarily storing the matrix vector multiplication in the accumulator. The buffer index of frag1 and frag2 are swapped.

Wherein, S72 is the longest time consuming, but this step can be performed uninterruptedly in the calculation process by the double buffering technique, the bandwidth is maximized, and the delays of S73 and S74 are completely masked.

And updating the calculation result in the block into the HBM. This step requires adding a mutex lock to the data at the corresponding address on the HBM to effect atomic addition, preferably by a system call instruction.

The beneficial effects of the invention are as follows:

1. the invention designs a method for performing matrix vector multiplication by using high computation power of a matrix core aiming at a modern deep learning accelerator with the matrix computation core. The matrix computing unit performs much higher than conventional vector computing units even if some computing power is wasted.

2. Aiming at the characteristic of matrix vector multiplication, namely being limited by access bandwidth and unbalanced input and reading, the invention designs a calculation mode different from the traditional matrix multiplication, and the input matrix occupying larger storage space is sliced and is continuously read by a double-buffer technology alternately. And by locking the corresponding addresses of the HBM, the calculation result of each matrix block can be subjected to atomic accumulation on the HBM.

Drawings

Fig. 1 illustrates: in deep learning accelerators, tasks are logically divided into multiple shares, each referred to as a block, which independently execute a subtask without interfering with each other. Deep learning accelerators include a number of processors that are physical concepts, the number being limited by hardware specifications. The program is executed by placing blocks on the processor through the scheduler, as shown by the workflow in the figure. Each block may be internally divided into a plurality of working groups, which are the basic units of execution of SIMD instructions, depending on the configuration in the initiator function, each working group containing a number of threads, depending on the specific hardware specifications.

Fig. 2 illustrates: the storage unit programmable on the processor comprises a register, an LDS and an HBM outside the processor. The invention relates to a computing unit on a processor, which is a matrix core. Data is usually loaded from the HBM to the LDS for buffering or some preprocessing operations, then is loaded from the LDS to the register, and the matrix core takes elements in the specific register as input, calculates output and accumulates on the specific register.

Fig. 3 illustrates: __ building_amdgcn_mfma_f32_4x4f16 is one of the instructions on AMD CDNA2 architecture that call the matrix core for computation. The instruction can complete matrix multiplication calculation of 4x4x4 of 16 batches in one calculation period, and the unit of execution is a work group. The registers of each thread in the working group store a portion of the input a matrix and B matrix, with the data distribution shown.

Example 1

The following description is intended to provide a thorough understanding of the invention in order to better illustrate the detailed details of the invention. However, the practical implementation process is not limited to the hardware architecture mentioned in the embodiments, and the same concept is used to solve similar problems under different hardware architectures.

The deep learning accelerator based on this embodiment is a GPU of MI200 series of AMDs, MI210 to achieve half-precision matrix vector multiplication. Specific specification parameters associated with MI210 are as follows:

peak clock frequency	1700Mhz
		Number of parallel processors	104
Number of SIMD units per processor	4
		Thread count per workgroup	64

The present embodiment is based on token generation tasks of a transducer structure model in deep learning. The model spells a word vector into the input of the last inference every time it is generated and takes it as the input of a new round. Therefore, taking the calculation of the KV matrix in the intent module as an example, except for the newly spliced word vector, the KV matrix corresponding to all the previous word vectors is already calculated in the previous several times of pushing, so that the KV matrix is usually cached by using the KV cache technology, and each round of calculation of the KV matrix only needs to calculate a new KV vector obtained by the new word vector and the weight dot product. KV cache has been widely used in this field, and core computing is matrix vector multiplication.

The present embodiment uses the mma instruction __ block_amdgcn_mfma_f32_4x4x4f16 in the MI210 accelerator corresponding instruction set to perform the core's computational tasks. The instructions may cause the workgroup on each processor to compute a 4x4x4 half-precision matrix multiplication of 16 batches simultaneously, when performing matrix vector multiplication, only the vector portion of the data in the matrix corresponding to the input vector will be valid. Figure 3 depicts the placement of the data required for the mma instruction to be input and output in each thread of a workgroup.

S1, determining two dimensions TILE_K and TILE_N of a block size according to an input shape in a compiling stage, and recording information to a disk for direct reading configuration in running. The following steps will refer to equation 3 to arrive at the most suitable configuration.

S11, according to the number of threads of the working group in the MI210 being 64, the tile_n is set to one of 64, 128, 192, 256. A maximum value is set for SPLICT_K, MAX_SPLICT_K, SPLICT_K traverses from 1 to MAX_SPLICT_K, each SPLICT_K is configured with one of four TILE_N, the L value is measured by the formula 3, and the L value is recorded in an array.

And S12, after traversing all combinations, setting a super parameter b, carrying out S2 and the following steps on block configuration corresponding to b L values with the minimum value, obtaining a test result, and taking the block configuration with the highest performance as a final block configuration.

S2, according to an instruction set of the architecture corresponding to the MI210, determining __ building_amdgcn_mfma_f32_4x4f16 as an API of the execution matrix core. Since this embodiment aims at obtaining a 1×n vector using a 1×k vector right-multiplied by k×n matrix, the matrix block will be placed in fragmentB.

S3, each block creates double buffering on LDS and registers, LDS_BUFFER_1, LDS_BUFFER_2 and FRAGMENT_B1, FRAGMENT_B2. Each buffer area on the LDS has a size of 64×4 half types, and each buffer area on the register has a size of 4 half types. The tile_n configuration determined according to S12 corresponds to 1,2,3,4 work groups in the block, and also corresponds to 64, 128, 192, 256 threads, respectively. The present embodiment is described by taking tile_k=384 and tile_n=64 as an example.

S4, each thread of the prefetch block reads 1 half2 type of data at a time, 64 threads read two rows with intervals of N and lengths of 32 half2 types on the HBM at a time, and the 64 half2 types of data are fed into the LDS_BUFFER_1 of the LDS. The matrix core calculates the size of the input matrix 4x64 at a time, so this process is performed twice.

S5, iterating TILE_K/4 times, taking an input matrix and a vector, and calling an instruction __ builtin_amdgcn_mfma_f32_4x4f16 to calculate and accumulate. The method specifically comprises the following steps:

s51, referring to S4, 64 threads in each working group read two rows with HBM interval of N at a time, and 64 half2 type data are input into LDS_BUFFER_2 of the LDS.

The i-th element i, i+64, i+128, i+192 in LDS_BUFFER_1 is read by the thread numbered i in S52, 64 threads to the thread private FRAGMENT_B1.

S53, the thread with the thread number i being a multiple of 4 fetches the input vector from the HBM into the register FRAGMENT_A. Checkpoints within the blocks are added so that all work groups within the blocks are synchronized at that location.

S54, calling mma instruction to perform matrix vector multiplication once, and accumulating output result of the first element in fragmentD in each thread.

The i-th, i+64, i+128, i+192 elements in LDS_BUFFER_2 are read by the thread numbered i in S55, 64 threads into thread private FRAGMENT_B2.

S56, exchanging pointers of LDS_BUFFER_1 and LDS_BUFFER_2 on LDS, exchanging pointers of FRAGMENT_B1 and FRAGMENT_B2 on register.

S6, accumulating the result in the final fragmentD on the HBM by using an atomic Add instruction provided by hardware.

The matrix computing unit in the MI210 deep learning accelerator can be used for realizing high-performance matrix vector multiplication through the above process, and the method can be implemented on various parallel architectures by changing super-parameter settings, so that the reasoning performance in the deep learning is greatly accelerated. While the present invention has been described with reference to the accompanying drawings and the embodiments thereof in the deep learning application, those skilled in the art will readily understand that the scope of the present invention is not limited to the specific embodiments on the specific hardware architecture, and that the skilled person can replace the relevant technical features with respect to the specific hardware architecture or the application scenario without departing from the principles of the present invention, and those technical solutions after replacement should fall within the scope of the present invention.

Claims

1. A high performance matrix vector multiplication method based on a matrix core, comprising the steps of:

s1, constructing a task grid, and determining a plurality of super parameters related to task division among blocks and inside blocks by a precompiled method.

S2, setting mma instructions to be adopted according to a hardware architecture, wherein the key is to accord with a calculation form of matrix vector multiplication, so that matrix core calculation force is wasted to the minimum.

S3, initializing double buffering, buffer1 and buffer2 on the chip, and prefetching a matrix block to be calculated of a first block into the buffer 1.

S4, taking down a matrix block to be processed to buffer2.

S5, taking the vector to be processed to an input unit corresponding to the matrix core, synchronizing all operations, and exchanging pointers pointing to buffer1 and buffer2.

S6, calling a relevant mma instruction, operating the vector to be processed and the matrix by using the matrix core, and accumulating the result into the on-chip memory.

S7, each block writes the calculation result atoms back to the corresponding HBM storage space. And respectively completing the calculation of the respective subtasks.

2. The method of claim 1, wherein the task grid construction method of step one is to divide a matrix block with a size of tile_k×tile_n based on a matrix in the matrix vector multiplication, and obtain optimal values of tile_k and tile_n according to an enumeration test of equation 3 and a compilation stage. Further, the above value of the super-parameters is limited by two factors, one is the number of times the result vector is accumulated in the K direction, the less and better it is desired, and the two is the total number of blocks, ideally an integer multiple of the physical processor.

3. A high performance matrix vector multiplication method based on a matrix core as claimed in claim 1. The method is characterized in that the matrix vector multiplication pre-compiling method is deduced on a given interval according to the claim 1 and the formula 3, and enumeration tests are carried out on the execution performance of a plurality of preselected schemes, so that the optimal partitioning scheme under the current network is obtained. The enumeration test method is not unique, but the process is similar and can achieve the same purpose. When the type and configuration of the neural network are determined, the types of shapes of matrix vector multiplication computation are very few, so the computation mode and data distribution of matrix vector multiplication in a block are set according to the matrix core supported by the architecture for a specific network.

4. The method of claim 1, wherein the hyper-parameters to be determined mainly include the single processing task size in the task grid, and in the method mainly refer to the number of rows and columns of the input matrix sub-blocks.

5. The method of claim 1, wherein the matrix core includes, but is not limited to Yu Yingwei, a Tensor core computing unit on the GPU and a cube computing unit on the lifting NPU.

6. The method of claim 1, wherein the mma instruction is an instruction associated with the matrix core in the deep learning accelerator for computing a most basic matrix multiply-add operation.

7. A high performance matrix vector multiplication method based on matrix cores according to claim 1, wherein double buffering is only for the input matrix. It is necessary to open up double buffer units in the input memory units accepted by the matrix core, and if it is necessary to rearrange the input matrix in the LDS, it is also necessary to open up double buffer in the LDS.

8. The double buffer unit of claim 7, wherein the buffer size is limited by hardware memory resources, but can be on any level of memory unit on the chip, and the main purpose is to make the instructions parallel, so that the current instruction does not need to wait for the execution of the previous instruction to complete because of data collision.

9. The method of claim 1, wherein the vector to be processed is read from any level of cache in a most bandwidth efficient manner according to a specific hardware architecture. In this scenario, the vector should be read out much less time than the matrix to be processed.

10. A matrix core based high performance matrix vector multiplication method according to claim 1 wherein when each block has fetched an input vector and is ready to begin executing matrix multiply add mma instructions, all instructions need to be synchronized before the pointers of the double buffered two block buffers are swapped.

11. The exchange of pointers for two blocks of double-buffered memory according to claim 10, wherein the contents stored in the two blocks of buffered memory are not changed and only their identities are exchanged. I.e. the next iteration, the head address pointed by buffer1 is the head address of the current buffer2, and the head address pointed by buffer2 is the head address in the current buffer 1.

12. The matrix core based high performance matrix vector multiplication method of claim 1, wherein the on-chip memory for accumulating results by mma instructions includes high-speed memory units such as registers, LDS, etc. in the processor, mainly for the purpose of concentrating time-consuming protocol operations on the chip as much as possible, and reducing the number of accesses to HBMs.

13. The high-performance matrix vector multiplication method based on the matrix core according to claim 1, wherein each block performs atomic addition on the HBM of the final calculation result, the method is completed through a corresponding atomic instruction provided by a deep learning accelerator, and the atomic operation ensures mutual exclusion access of different blocks to the same HBM area, and ensures the accuracy of calculation, unlike the traditional addition calculation.