CN117150194A

CN117150194A - Batch processing matrix multiplication optimization realization method and system for heterogeneous processor

Info

Publication number: CN117150194A
Application number: CN202311014624.4A
Authority: CN
Inventors: 全哲; 张梦豪; 翟宇; 李磊
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2023-08-14
Filing date: 2023-08-14
Publication date: 2023-12-01

Abstract

The application discloses a batch processing matrix multiplication optimization realization method and a batch processing matrix multiplication optimization realization system for heterogeneous processors, wherein the method comprises the following steps: step S1, a CPU allocates space to a matrix on a shared DDR memory through an hthread_mal loc function; step S2, a CPU calculates decision parameters through a decision algorithm, wherein the decision parameters comprise matrix block size parameters and m_batch size parameters; and step S3, starting the DSP function based on the decision parameter. The application has the following beneficial effects: the DSP is used for realizing a high-efficiency batch matrix multiplication (BGEMM) algorithm, so that multi-field application including deep learning can be effectively accelerated; the access delay of the DSP computing unit is reduced, and the computing speed is improved; and the computing efficiency of the GEMM is improved.

Description

Batch processing matrix multiplication optimization realization method and system for heterogeneous processor

Technical Field

The application belongs to the technical field of computers, and particularly relates to a batch processing matrix multiplication optimization realization method and system for heterogeneous processors.

Background

The basic linear algebraic program set (Basic Linear Algebra Subprograms, BLAS) is a numerical library interface standard for basic linear algebraic operations. The most core algorithm interface is matrix multiplication (General Matrix Multiplication, GEMM), which is widely used in the fields of deep learning, signal processing, physics, computational chemistry and the like.

In recent decades, academia and industry have made a great deal of optimization for GEMM implementation. With the development of chip technology, the hardware computing power is continuously improved, so how to fully utilize hardware resources and improve the problem solving parallelism becomes a hot spot for research in recent years. The current trend is to break up a large-scale task into multiple small-scale subtasks, where the processor solves the subtasks in parallel. However, if the amount of subtask operations is too small, hardware operation resources are not fully utilized, and further, the performance of the processor cannot be fully exerted, and the acceleration effect cannot be achieved. Therefore, a plurality of subtasks can be solved at one time by introducing batch processing, and the utilization rate of resources is improved.

In order to cope with the wide application demands of batch processing in various fields, an expert expands the original BLAS interface, and adds a batch processing interface standard (Batched BLAS), and the corresponding batch processing matrix multiplication (BGEMM) is also the most widely applied interface at present.

Taking the deep learning field as an example, in the convolutional neural network structure (Convolutional Neural Network, CNN), 90% of the time is consumed by the model calculation in the fully connected layer and the convolutional layer, and the calculation process can be converted into BGEMM calculation. Optimizing the implementation of BGEMMs is therefore of great importance for accelerating AI training and reasoning.

Each hardware manufacturer performs specific optimization for the chip architecture, and provides a special calculation library. The BGEMM interface optimized for a particular instruction set architecture is implemented in a well known computing library. Such as ARMPL (Arm Performance Libraries) for ARM architecture, oneMKL (Intel oneAPI Math Kernel Library) for X86 architecture, cuDNN (NVlDIA CUDA Deep Neural Network library) for NVlDIA GPU architecture, MAGMA (Matrix Algebra on GPU and Multi-core Architectures), etc.

All of the above are optimizations for their specific chip architecture instruction sets, and a chip employing a new instruction set architecture cannot be directly reused, which would result in poor performance if simply ported.

Disclosure of Invention

The embodiment of the application aims to provide a batch processing matrix multiplication optimization realization method and system for heterogeneous processors, which realize the cooperative work of host and device terminals; an efficient batch matrix multiplication algorithm is implemented at the device end, so that at least one technical problem related to the background technology can be solved.

In order to solve the technical problems, the application provides the following technical scheme:

a batch processing matrix multiplication optimization implementation method for heterogeneous processors comprises the following steps:

step S1, a CPU allocates space to a matrix on a shared DDR memory through an hthread_malloc function;

step S2, a CPU calculates decision parameters through a decision algorithm, wherein the decision parameters comprise matrix block size parameters and m_batch size parameters;

step S3, starting a DSP function based on the decision parameter, wherein the DSP function adopts the following batch processing matrix multiplication:

C _i ＝α _i ·A _i ×B _i +β _i ·C _i

where i=1, 2,3,..n represents the subscript of each matrix multiplication;and->For input matrix +.>Is an output matrix; alpha _i And beta _i Is a scalar, representing an arithmetic coefficient.

Optionally, in step S2, the CPU calculates decision parameters by a decision algorithm, including calculation of a matrix block size parameter and calculation of an m_batch size parameter.

Optionally, the calculating the matrix block size parameter includes:

according to the principle of space locality and time locality, according to the size of the actual storage space of the DSP, two matrix multiplication operations are thinned into a group of panel-panel multiplication operations and further thinned into a group of block-panel multiplication operations, so that data required to be calculated by each DSP calculation unit are stored in a scalar storage space and a vector storage space;

the parameters of the sub-blocks of the matrix meet the following formula constraint, the maximum calculation memory access ratio is solved, and the size parameters of the matrix blocks are determined:

M _SM *K _CG *2*w≤size(SM)

(M _CG +K _CG )*N _AM *2*w≤size(AM)

where w is a data type byte, M _SM *K _CG Is a sub-block of matrix A, K _CG *N _AM Is a sub-block of matrix B, M _SM *N _AM For the sub-blocks of matrix C, SM is scalar memory space and AM is vector memory space.

Optionally, the calculating the m_batch size parameter includes:

according to the core number, the calculated matrix number is allocated, and the parameters meet the following formula constraint:

m_batch*M _CG *K _CG *w≤size(GSM)

in the formula, GSM is the shared memory in the acceleration cluster.

Optionally, in step S3, the starting the DSP function based on the decision parameter includes:

step S31, preparing calculation data, transmitting data of sub-blocks of m_batch matrix A of a first batch from DDR to GSM by DMA, and respectively carrying data of sub-blocks of m_batch matrix B and matrix C corresponding to each core from DDR to a first buffer area of AM;

step S32, further partitioning the sub-blocks of the matrix A in the GSM, and carrying the data of the sub-blocks of the matrix A from the on-chip GSM to a first buffer area of the SM by the DMA;

step S33, the DSP computing unit reads the data stored in the first buffer area in the AM and the SM, calls the assembler for computing, simultaneously the DMA transfers the data of the sub-block of the matrix A from the GSM on the chip to the second buffer area in the SM, and transfers the data of the sub-blocks of the matrix B and the matrix C from the DDR to the second buffer area in the AM;

step S34, after the calculation in the step S33 is completed, the DMA carries the calculation result of the first buffer area of the AM to the DDR, and simultaneously the DSP calculation unit reads the data stored in the second buffer areas of the AM and the SM for calculation;

step S35, the DSP computing unit continues to read the data stored in the second buffer areas in the AM and the SM for computing, and simultaneously the DMA carries the data of the sub-blocks of the matrix A from the GSM on the chip to the first buffer area of the SM, and carries the data of the sub-blocks of the matrix B and the matrix C from the DDR to the first buffer area of the AM;

step S36, repeating the steps S31-S35 until the m_batch matrix GEMM calculation is completed;

and S37, taking the m_batch matrix for calculation once until all the N matrixes are calculated.

Optionally, the heterogeneous processor is an MT7032 heterogeneous many-core microprocessor.

The application also provides a batch processing matrix multiplication optimization realization system for the heterogeneous processor, which comprises the following steps:

the space allocation module is used for allocating space to the matrix on the shared DDR memory through the hthread_malloc function by the CPU;

the decision parameter calculation module is used for calculating a decision parameter by the CPU through a decision algorithm;

the core calculation module is used for starting the DSP function based on the decision parameters and realizing batch processing matrix multiplication optimization.

The application has the following beneficial effects:

1. the DSP is used for realizing a high-efficiency batch matrix multiplication (BGEMM) algorithm, so that multi-field application including deep learning can be effectively accelerated;

2. by reducing the multiplication operation of two matrixes into the multiplication operation of a group of panel-panel and the multiplication operation of a group of block-panel, the data which needs to be calculated by the DSP calculation unit each time can be stored in SM and AM, thereby reducing the access delay of the DSP calculation unit and improving the calculation speed;

3. in the DPS calculation process, a double BUFFER mechanism is adopted, namely SM, AM and GSM are divided into two parts, one part is used for calculation, the other part carries data, and the access time is covered by using time. Two buffer areas are set up for the sub-blocks of the matrix A in the scalar storage Space (SM), two buffer areas are set up for the sub-blocks of the matrix B, C in the vector storage space (AM), and core calculation and DMA data carrying overlapping are realized in a DMA double-buffering mode, so that access time is hidden, and the calculation efficiency of the GEMM is improved.

Drawings

For a clearer description of the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the description below are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art, wherein:

FIG. 1 is a diagram of a DSP memory hierarchy provided by an embodiment of the application;

FIG. 2 is a block-based computation graph of a matrix provided by an embodiment of the present application;

FIG. 3 is a block diagram of a batch matrix multiplication optimization implementation system for heterogeneous processors according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type, and are not limited to the number of objects, such as the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.

The batch processing matrix multiplication optimization implementation method for the heterogeneous processor provided by the embodiment of the application is described in detail through specific embodiments and application scenes thereof by combining the attached drawings.

The embodiment of the application provides a batch processing matrix multiplication optimization implementation method for a heterogeneous processor, which is to be noted that the heterogeneous processor is an MT7032 heterogeneous many-core microprocessor, the processor is a verification chip of a new generation super-computing prototype of Tianhe, and the batch processing matrix multiplication optimization implementation method is completely independently researched and developed by China and has strong calculation performance. MT7032 consists of 16-core ARMv8 CPU and 4 mutually independent DSP acceleration clusters. Each DSP cluster consists of 8 DSP cores, sharing 6MB of on-chip memory space (Global Shared Memory) and 32GB of off-chip DDR space. The DSP core mainly includes an instruction flow control unit (Instruction Fetch Unit, IFU), a Scalar processing unit (Scalar Processing Unit, SPU), and a Scalar Memory (SM) of 64KB, a vector processing unit (Vector Processing Unit, VPU) and a vector Memory (AM) of 768KB, 16 in each VPU VPE (Vector Processing Element), each VPE can execute 3 floating point multiply add instructions at the same time, and a direct Memory access unit (Direct Memory Access, DMA).

The method comprises the following steps:

C _i ＝α _i ·A _i ×B _i +β _i ·C _i

In step S2, the CPU calculates decision parameters by a decision algorithm, including calculation of matrix block size parameters and calculation of m_batch size parameters.

Wherein, the calculation of the matrix block size parameter comprises:

the memory resources required by the DSP during BGEMM calculation may be divided into four layers of pyramid structures from top to bottom, as shown in fig. 1. The application reasonably blocks the matrix, refines the multiplication operation of two matrixes into the multiplication operation of a group of panel-panel and the multiplication operation of a group of block-panel according to the size of the actual storage space of the DSP by utilizing the principles of space locality and time locality, and particularly combines with the illustration of figure 2 to store the data required to be calculated by each DSP calculation unit in scalar storage space and AM, thereby reducing the access delay of the DSP calculation unit and improving the calculation speed.

In practical considerations of vector memory, scalar memory and capacity limitations of on-chip global shared cache, for data type bytes w, sub-blocks of matrix a (M _SM *K _CG ) Sub-blocks of matrix B (K _CG *N _AM ) And sub-blocks of matrix C (M _SM *N _AM ) The parameters of (a) need to satisfy the following constraints:

M _SM *K _CG *2*w≤size(SM)

(M _CG +K _CG )*N _AM *2*w≤size(AM)

m_batch*M _CG *K _CG *w≤size(GSM)

and meanwhile, under the constraint condition, solving the maximum calculation memory ratio, and determining decision parameters.

For m_batch matrices calculated at a time, there are 8 cores in a DSP cluster. Assuming that m_batch=2 calculated above, then core 0-core 3 calculates the 1 st matrix and core 4-core 7 calculates the 2 nd matrix. The matrix number is allocated according to the core number, so that the calculation performed by each core is GEMM calculation, and the calculation of a plurality of cores forms BGEMM calculation.

In step S3, the starting the DSP function based on the decision parameter includes:

It should be noted that, when the DSP function is calculated, a dual BUFFER mechanism is adopted, that is, SM, AM and GSM are divided into two parts, one part is used for calculation, and the other part carries data, so that access time is covered by using time. Two buffer areas are set up for the sub-blocks of the matrix A in the scalar storage Space (SM), two buffer areas are set up for the sub-blocks of the matrix B, C in the vector storage space (AM), and core calculation and DMA data carrying overlapping are realized in a DMA double-buffering mode, so that access time is hidden, and the calculation efficiency of the GEMM is improved.

The application also provides a batch processing matrix multiplication optimization implementation system for the heterogeneous processor, which is used for the method, and is shown in fig. 3, and the system comprises: a space allocation module 1, a decision parameter calculation module 2 and a core calculation module 3.

The space allocation module 1 is used for allocating space to the matrix on the shared DDR memory by the CPU through an hthread_malloc function;

the decision parameter calculation module 2 is used for calculating a decision parameter by a CPU through a decision algorithm;

the core computing module 3 is used for starting a DSP function based on decision parameters to realize batch matrix multiplication optimization.

The application has the following beneficial effects:

2. by reducing the multiplication operation of two matrixes into a group of panel-panel multiplication operation and further reducing the multiplication operation of a group of block-panel multiplication operation, the data which needs to be calculated by the DSP calculation unit each time can be stored in SM and AM, thereby reducing the access delay of the DSP calculation unit and improving the calculation speed;

3. in the DSP calculation process, a double BUFFER mechanism is adopted, namely SM, AM and GSM are divided into two parts, one part is used for calculation, the other part carries data, and the access time is covered by using time. Two buffer areas are set up for the sub-blocks of the matrix A in the scalar storage Space (SM), two buffer areas are set up for the sub-blocks of the matrix B, C in the vector storage space (AM), and core calculation and DMA data carrying overlapping are realized in a DMA double-buffering mode, so that access time is hidden, and the calculation efficiency of the GEMM is improved.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.

Claims

1. The batch processing matrix multiplication optimization implementation method for the heterogeneous processor is characterized by comprising the following steps of:

C _i ＝α _i ·A _i ×B _i +β _i ·C _i

2. The method according to claim 1, wherein in step S2, the CPU calculates decision parameters by a decision algorithm, including a matrix block size parameter calculation and an m_batch size parameter calculation.

3. The method of claim 2, wherein the matrix partition size parameter calculation comprises:

M _SM *K _CG *2*w≤size(SM)

(M _CG +K _CG )*N _AM *2*w≤size(AM)

4. The method of claim 3, wherein the m_batch size parameter calculation comprises:

m_batch*M _CG *K _CG *w≤size(GSM)

in the formula, GSM is the shared memory in the acceleration cluster.

5. The method according to claim 4, wherein in step S3, the starting the DSP function based on the decision parameters comprises:

6. The method of claim 1, wherein the heterogeneous processor is an MT7032 heterogeneous many-core microprocessor.

7. A heterogeneous processor-oriented batch matrix multiplier optimization implementation system for executing the method of any of claims 1-6, the system comprising: