CN117150194A - Batch processing matrix multiplication optimization realization method and system for heterogeneous processor - Google Patents

Batch processing matrix multiplication optimization realization method and system for heterogeneous processor Download PDF

Info

Publication number
CN117150194A
CN117150194A CN202311014624.4A CN202311014624A CN117150194A CN 117150194 A CN117150194 A CN 117150194A CN 202311014624 A CN202311014624 A CN 202311014624A CN 117150194 A CN117150194 A CN 117150194A
Authority
CN
China
Prior art keywords
matrix
batch
calculation
sub
dsp
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311014624.4A
Other languages
Chinese (zh)
Inventor
全哲
张梦豪
翟宇
李磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN202311014624.4A priority Critical patent/CN117150194A/en
Publication of CN117150194A publication Critical patent/CN117150194A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Mathematics (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Complex Calculations (AREA)

Abstract

The application discloses a batch processing matrix multiplication optimization realization method and a batch processing matrix multiplication optimization realization system for heterogeneous processors, wherein the method comprises the following steps: step S1, a CPU allocates space to a matrix on a shared DDR memory through an hthread_mal loc function; step S2, a CPU calculates decision parameters through a decision algorithm, wherein the decision parameters comprise matrix block size parameters and m_batch size parameters; and step S3, starting the DSP function based on the decision parameter. The application has the following beneficial effects: the DSP is used for realizing a high-efficiency batch matrix multiplication (BGEMM) algorithm, so that multi-field application including deep learning can be effectively accelerated; the access delay of the DSP computing unit is reduced, and the computing speed is improved; and the computing efficiency of the GEMM is improved.

Description

Batch processing matrix multiplication optimization realization method and system for heterogeneous processor
Technical Field
The application belongs to the technical field of computers, and particularly relates to a batch processing matrix multiplication optimization realization method and system for heterogeneous processors.
Background
The basic linear algebraic program set (Basic Linear Algebra Subprograms, BLAS) is a numerical library interface standard for basic linear algebraic operations. The most core algorithm interface is matrix multiplication (General Matrix Multiplication, GEMM), which is widely used in the fields of deep learning, signal processing, physics, computational chemistry and the like.
In recent decades, academia and industry have made a great deal of optimization for GEMM implementation. With the development of chip technology, the hardware computing power is continuously improved, so how to fully utilize hardware resources and improve the problem solving parallelism becomes a hot spot for research in recent years. The current trend is to break up a large-scale task into multiple small-scale subtasks, where the processor solves the subtasks in parallel. However, if the amount of subtask operations is too small, hardware operation resources are not fully utilized, and further, the performance of the processor cannot be fully exerted, and the acceleration effect cannot be achieved. Therefore, a plurality of subtasks can be solved at one time by introducing batch processing, and the utilization rate of resources is improved.
In order to cope with the wide application demands of batch processing in various fields, an expert expands the original BLAS interface, and adds a batch processing interface standard (Batched BLAS), and the corresponding batch processing matrix multiplication (BGEMM) is also the most widely applied interface at present.
Taking the deep learning field as an example, in the convolutional neural network structure (Convolutional Neural Network, CNN), 90% of the time is consumed by the model calculation in the fully connected layer and the convolutional layer, and the calculation process can be converted into BGEMM calculation. Optimizing the implementation of BGEMMs is therefore of great importance for accelerating AI training and reasoning.
Each hardware manufacturer performs specific optimization for the chip architecture, and provides a special calculation library. The BGEMM interface optimized for a particular instruction set architecture is implemented in a well known computing library. Such as ARMPL (Arm Performance Libraries) for ARM architecture, oneMKL (Intel oneAPI Math Kernel Library) for X86 architecture, cuDNN (NVlDIA CUDA Deep Neural Network library) for NVlDIA GPU architecture, MAGMA (Matrix Algebra on GPU and Multi-core Architectures), etc.
All of the above are optimizations for their specific chip architecture instruction sets, and a chip employing a new instruction set architecture cannot be directly reused, which would result in poor performance if simply ported.
Disclosure of Invention
The embodiment of the application aims to provide a batch processing matrix multiplication optimization realization method and system for heterogeneous processors, which realize the cooperative work of host and device terminals; an efficient batch matrix multiplication algorithm is implemented at the device end, so that at least one technical problem related to the background technology can be solved.
In order to solve the technical problems, the application provides the following technical scheme:
a batch processing matrix multiplication optimization implementation method for heterogeneous processors comprises the following steps:
step S1, a CPU allocates space to a matrix on a shared DDR memory through an hthread_malloc function;
step S2, a CPU calculates decision parameters through a decision algorithm, wherein the decision parameters comprise matrix block size parameters and m_batch size parameters;
step S3, starting a DSP function based on the decision parameter, wherein the DSP function adopts the following batch processing matrix multiplication:
C i =α i ·A i ×B ii ·C i
where i=1, 2,3,..n represents the subscript of each matrix multiplication;and->For input matrix +.>Is an output matrix; alpha i And beta i Is a scalar, representing an arithmetic coefficient.
Optionally, in step S2, the CPU calculates decision parameters by a decision algorithm, including calculation of a matrix block size parameter and calculation of an m_batch size parameter.
Optionally, the calculating the matrix block size parameter includes:
according to the principle of space locality and time locality, according to the size of the actual storage space of the DSP, two matrix multiplication operations are thinned into a group of panel-panel multiplication operations and further thinned into a group of block-panel multiplication operations, so that data required to be calculated by each DSP calculation unit are stored in a scalar storage space and a vector storage space;
the parameters of the sub-blocks of the matrix meet the following formula constraint, the maximum calculation memory access ratio is solved, and the size parameters of the matrix blocks are determined:
M SM *K CG *2*w≤size(SM)
(M CG +K CG )*N AM *2*w≤size(AM)
where w is a data type byte, M SM *K CG Is a sub-block of matrix A, K CG *N AM Is a sub-block of matrix B, M SM *N AM For the sub-blocks of matrix C, SM is scalar memory space and AM is vector memory space.
Optionally, the calculating the m_batch size parameter includes:
according to the core number, the calculated matrix number is allocated, and the parameters meet the following formula constraint:
m_batch*M CG *K CG *w≤size(GSM)
in the formula, GSM is the shared memory in the acceleration cluster.
Optionally, in step S3, the starting the DSP function based on the decision parameter includes:
step S31, preparing calculation data, transmitting data of sub-blocks of m_batch matrix A of a first batch from DDR to GSM by DMA, and respectively carrying data of sub-blocks of m_batch matrix B and matrix C corresponding to each core from DDR to a first buffer area of AM;
step S32, further partitioning the sub-blocks of the matrix A in the GSM, and carrying the data of the sub-blocks of the matrix A from the on-chip GSM to a first buffer area of the SM by the DMA;
step S33, the DSP computing unit reads the data stored in the first buffer area in the AM and the SM, calls the assembler for computing, simultaneously the DMA transfers the data of the sub-block of the matrix A from the GSM on the chip to the second buffer area in the SM, and transfers the data of the sub-blocks of the matrix B and the matrix C from the DDR to the second buffer area in the AM;
step S34, after the calculation in the step S33 is completed, the DMA carries the calculation result of the first buffer area of the AM to the DDR, and simultaneously the DSP calculation unit reads the data stored in the second buffer areas of the AM and the SM for calculation;
step S35, the DSP computing unit continues to read the data stored in the second buffer areas in the AM and the SM for computing, and simultaneously the DMA carries the data of the sub-blocks of the matrix A from the GSM on the chip to the first buffer area of the SM, and carries the data of the sub-blocks of the matrix B and the matrix C from the DDR to the first buffer area of the AM;
step S36, repeating the steps S31-S35 until the m_batch matrix GEMM calculation is completed;
and S37, taking the m_batch matrix for calculation once until all the N matrixes are calculated.
Optionally, the heterogeneous processor is an MT7032 heterogeneous many-core microprocessor.
The application also provides a batch processing matrix multiplication optimization realization system for the heterogeneous processor, which comprises the following steps:
the space allocation module is used for allocating space to the matrix on the shared DDR memory through the hthread_malloc function by the CPU;
the decision parameter calculation module is used for calculating a decision parameter by the CPU through a decision algorithm;
the core calculation module is used for starting the DSP function based on the decision parameters and realizing batch processing matrix multiplication optimization.
The application has the following beneficial effects:
1. the DSP is used for realizing a high-efficiency batch matrix multiplication (BGEMM) algorithm, so that multi-field application including deep learning can be effectively accelerated;
2. by reducing the multiplication operation of two matrixes into the multiplication operation of a group of panel-panel and the multiplication operation of a group of block-panel, the data which needs to be calculated by the DSP calculation unit each time can be stored in SM and AM, thereby reducing the access delay of the DSP calculation unit and improving the calculation speed;
3. in the DPS calculation process, a double BUFFER mechanism is adopted, namely SM, AM and GSM are divided into two parts, one part is used for calculation, the other part carries data, and the access time is covered by using time. Two buffer areas are set up for the sub-blocks of the matrix A in the scalar storage Space (SM), two buffer areas are set up for the sub-blocks of the matrix B, C in the vector storage space (AM), and core calculation and DMA data carrying overlapping are realized in a DMA double-buffering mode, so that access time is hidden, and the calculation efficiency of the GEMM is improved.
Drawings
For a clearer description of the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the description below are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art, wherein:
FIG. 1 is a diagram of a DSP memory hierarchy provided by an embodiment of the application;
FIG. 2 is a block-based computation graph of a matrix provided by an embodiment of the present application;
FIG. 3 is a block diagram of a batch matrix multiplication optimization implementation system for heterogeneous processors according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type, and are not limited to the number of objects, such as the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.
The batch processing matrix multiplication optimization implementation method for the heterogeneous processor provided by the embodiment of the application is described in detail through specific embodiments and application scenes thereof by combining the attached drawings.
The embodiment of the application provides a batch processing matrix multiplication optimization implementation method for a heterogeneous processor, which is to be noted that the heterogeneous processor is an MT7032 heterogeneous many-core microprocessor, the processor is a verification chip of a new generation super-computing prototype of Tianhe, and the batch processing matrix multiplication optimization implementation method is completely independently researched and developed by China and has strong calculation performance. MT7032 consists of 16-core ARMv8 CPU and 4 mutually independent DSP acceleration clusters. Each DSP cluster consists of 8 DSP cores, sharing 6MB of on-chip memory space (Global Shared Memory) and 32GB of off-chip DDR space. The DSP core mainly includes an instruction flow control unit (Instruction Fetch Unit, IFU), a Scalar processing unit (Scalar Processing Unit, SPU), and a Scalar Memory (SM) of 64KB, a vector processing unit (Vector Processing Unit, VPU) and a vector Memory (AM) of 768KB, 16 in each VPU VPE (Vector Processing Element), each VPE can execute 3 floating point multiply add instructions at the same time, and a direct Memory access unit (Direct Memory Access, DMA).
The method comprises the following steps:
step S1, a CPU allocates space to a matrix on a shared DDR memory through an hthread_malloc function;
step S2, a CPU calculates decision parameters through a decision algorithm, wherein the decision parameters comprise matrix block size parameters and m_batch size parameters;
step S3, starting a DSP function based on the decision parameter, wherein the DSP function adopts the following batch processing matrix multiplication:
C i =α i ·A i ×B ii ·C i
where i=1, 2,3,..n represents the subscript of each matrix multiplication;and->For input matrix +.>Is an output matrix; alpha i And beta i Is a scalar, representing an arithmetic coefficient.
In step S2, the CPU calculates decision parameters by a decision algorithm, including calculation of matrix block size parameters and calculation of m_batch size parameters.
Wherein, the calculation of the matrix block size parameter comprises:
the memory resources required by the DSP during BGEMM calculation may be divided into four layers of pyramid structures from top to bottom, as shown in fig. 1. The application reasonably blocks the matrix, refines the multiplication operation of two matrixes into the multiplication operation of a group of panel-panel and the multiplication operation of a group of block-panel according to the size of the actual storage space of the DSP by utilizing the principles of space locality and time locality, and particularly combines with the illustration of figure 2 to store the data required to be calculated by each DSP calculation unit in scalar storage space and AM, thereby reducing the access delay of the DSP calculation unit and improving the calculation speed.
In practical considerations of vector memory, scalar memory and capacity limitations of on-chip global shared cache, for data type bytes w, sub-blocks of matrix a (M SM *K CG ) Sub-blocks of matrix B (K CG *N AM ) And sub-blocks of matrix C (M SM *N AM ) The parameters of (a) need to satisfy the following constraints:
M SM *K CG *2*w≤size(SM)
(M CG +K CG )*N AM *2*w≤size(AM)
m_batch*M CG *K CG *w≤size(GSM)
and meanwhile, under the constraint condition, solving the maximum calculation memory ratio, and determining decision parameters.
For m_batch matrices calculated at a time, there are 8 cores in a DSP cluster. Assuming that m_batch=2 calculated above, then core 0-core 3 calculates the 1 st matrix and core 4-core 7 calculates the 2 nd matrix. The matrix number is allocated according to the core number, so that the calculation performed by each core is GEMM calculation, and the calculation of a plurality of cores forms BGEMM calculation.
In step S3, the starting the DSP function based on the decision parameter includes:
step S31, preparing calculation data, transmitting data of sub-blocks of m_batch matrix A of a first batch from DDR to GSM by DMA, and respectively carrying data of sub-blocks of m_batch matrix B and matrix C corresponding to each core from DDR to a first buffer area of AM;
step S32, further partitioning the sub-blocks of the matrix A in the GSM, and carrying the data of the sub-blocks of the matrix A from the on-chip GSM to a first buffer area of the SM by the DMA;
step S33, the DSP computing unit reads the data stored in the first buffer area in the AM and the SM, calls the assembler for computing, simultaneously the DMA transfers the data of the sub-block of the matrix A from the GSM on the chip to the second buffer area in the SM, and transfers the data of the sub-blocks of the matrix B and the matrix C from the DDR to the second buffer area in the AM;
step S34, after the calculation in the step S33 is completed, the DMA carries the calculation result of the first buffer area of the AM to the DDR, and simultaneously the DSP calculation unit reads the data stored in the second buffer areas of the AM and the SM for calculation;
step S35, the DSP computing unit continues to read the data stored in the second buffer areas in the AM and the SM for computing, and simultaneously the DMA carries the data of the sub-blocks of the matrix A from the GSM on the chip to the first buffer area of the SM, and carries the data of the sub-blocks of the matrix B and the matrix C from the DDR to the first buffer area of the AM;
step S36, repeating the steps S31-S35 until the m_batch matrix GEMM calculation is completed;
and S37, taking the m_batch matrix for calculation once until all the N matrixes are calculated.
It should be noted that, when the DSP function is calculated, a dual BUFFER mechanism is adopted, that is, SM, AM and GSM are divided into two parts, one part is used for calculation, and the other part carries data, so that access time is covered by using time. Two buffer areas are set up for the sub-blocks of the matrix A in the scalar storage Space (SM), two buffer areas are set up for the sub-blocks of the matrix B, C in the vector storage space (AM), and core calculation and DMA data carrying overlapping are realized in a DMA double-buffering mode, so that access time is hidden, and the calculation efficiency of the GEMM is improved.
The application also provides a batch processing matrix multiplication optimization implementation system for the heterogeneous processor, which is used for the method, and is shown in fig. 3, and the system comprises: a space allocation module 1, a decision parameter calculation module 2 and a core calculation module 3.
The space allocation module 1 is used for allocating space to the matrix on the shared DDR memory by the CPU through an hthread_malloc function;
the decision parameter calculation module 2 is used for calculating a decision parameter by a CPU through a decision algorithm;
the core computing module 3 is used for starting a DSP function based on decision parameters to realize batch matrix multiplication optimization.
The application has the following beneficial effects:
1. the DSP is used for realizing a high-efficiency batch matrix multiplication (BGEMM) algorithm, so that multi-field application including deep learning can be effectively accelerated;
2. by reducing the multiplication operation of two matrixes into a group of panel-panel multiplication operation and further reducing the multiplication operation of a group of block-panel multiplication operation, the data which needs to be calculated by the DSP calculation unit each time can be stored in SM and AM, thereby reducing the access delay of the DSP calculation unit and improving the calculation speed;
3. in the DSP calculation process, a double BUFFER mechanism is adopted, namely SM, AM and GSM are divided into two parts, one part is used for calculation, the other part carries data, and the access time is covered by using time. Two buffer areas are set up for the sub-blocks of the matrix A in the scalar storage Space (SM), two buffer areas are set up for the sub-blocks of the matrix B, C in the vector storage space (AM), and core calculation and DMA data carrying overlapping are realized in a DMA double-buffering mode, so that access time is hidden, and the calculation efficiency of the GEMM is improved.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.

Claims (7)

1. The batch processing matrix multiplication optimization implementation method for the heterogeneous processor is characterized by comprising the following steps of:
step S1, a CPU allocates space to a matrix on a shared DDR memory through an hthread_malloc function;
step S2, a CPU calculates decision parameters through a decision algorithm, wherein the decision parameters comprise matrix block size parameters and m_batch size parameters;
step S3, starting a DSP function based on the decision parameter, wherein the DSP function adopts the following batch processing matrix multiplication:
C i =α i ·A i ×B ii ·C i
where i=1, 2,3,..n represents the subscript of each matrix multiplication;and->For input matrix +.>Is an output matrix; alpha i And beta i Is a scalar, representing an arithmetic coefficient.
2. The method according to claim 1, wherein in step S2, the CPU calculates decision parameters by a decision algorithm, including a matrix block size parameter calculation and an m_batch size parameter calculation.
3. The method of claim 2, wherein the matrix partition size parameter calculation comprises:
according to the principle of space locality and time locality, according to the size of the actual storage space of the DSP, two matrix multiplication operations are thinned into a group of panel-panel multiplication operations and further thinned into a group of block-panel multiplication operations, so that data required to be calculated by each DSP calculation unit are stored in a scalar storage space and a vector storage space;
the parameters of the sub-blocks of the matrix meet the following formula constraint, the maximum calculation memory access ratio is solved, and the size parameters of the matrix blocks are determined:
M SM *K CG *2*w≤size(SM)
(M CG +K CG )*N AM *2*w≤size(AM)
where w is a data type byte, M SM *K CG Is a sub-block of matrix A, K CG *N AM Is a sub-block of matrix B, M SM *N AM For the sub-blocks of matrix C, SM is scalar memory space and AM is vector memory space.
4. The method of claim 3, wherein the m_batch size parameter calculation comprises:
according to the core number, the calculated matrix number is allocated, and the parameters meet the following formula constraint:
m_batch*M CG *K CG *w≤size(GSM)
in the formula, GSM is the shared memory in the acceleration cluster.
5. The method according to claim 4, wherein in step S3, the starting the DSP function based on the decision parameters comprises:
step S31, preparing calculation data, transmitting data of sub-blocks of m_batch matrix A of a first batch from DDR to GSM by DMA, and respectively carrying data of sub-blocks of m_batch matrix B and matrix C corresponding to each core from DDR to a first buffer area of AM;
step S32, further partitioning the sub-blocks of the matrix A in the GSM, and carrying the data of the sub-blocks of the matrix A from the on-chip GSM to a first buffer area of the SM by the DMA;
step S33, the DSP computing unit reads the data stored in the first buffer area in the AM and the SM, calls the assembler for computing, simultaneously the DMA transfers the data of the sub-block of the matrix A from the GSM on the chip to the second buffer area in the SM, and transfers the data of the sub-blocks of the matrix B and the matrix C from the DDR to the second buffer area in the AM;
step S34, after the calculation in the step S33 is completed, the DMA carries the calculation result of the first buffer area of the AM to the DDR, and simultaneously the DSP calculation unit reads the data stored in the second buffer areas of the AM and the SM for calculation;
step S35, the DSP computing unit continues to read the data stored in the second buffer areas in the AM and the SM for computing, and simultaneously the DMA carries the data of the sub-blocks of the matrix A from the GSM on the chip to the first buffer area of the SM, and carries the data of the sub-blocks of the matrix B and the matrix C from the DDR to the first buffer area of the AM;
step S36, repeating the steps S31-S35 until the m_batch matrix GEMM calculation is completed;
and S37, taking the m_batch matrix for calculation once until all the N matrixes are calculated.
6. The method of claim 1, wherein the heterogeneous processor is an MT7032 heterogeneous many-core microprocessor.
7. A heterogeneous processor-oriented batch matrix multiplier optimization implementation system for executing the method of any of claims 1-6, the system comprising:
the space allocation module is used for allocating space to the matrix on the shared DDR memory through the hthread_malloc function by the CPU;
the decision parameter calculation module is used for calculating a decision parameter by the CPU through a decision algorithm;
the core calculation module is used for starting the DSP function based on the decision parameters and realizing batch processing matrix multiplication optimization.
CN202311014624.4A 2023-08-14 2023-08-14 Batch processing matrix multiplication optimization realization method and system for heterogeneous processor Pending CN117150194A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311014624.4A CN117150194A (en) 2023-08-14 2023-08-14 Batch processing matrix multiplication optimization realization method and system for heterogeneous processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311014624.4A CN117150194A (en) 2023-08-14 2023-08-14 Batch processing matrix multiplication optimization realization method and system for heterogeneous processor

Publications (1)

Publication Number Publication Date
CN117150194A true CN117150194A (en) 2023-12-01

Family

ID=88899742

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311014624.4A Pending CN117150194A (en) 2023-08-14 2023-08-14 Batch processing matrix multiplication optimization realization method and system for heterogeneous processor

Country Status (1)

Country Link
CN (1) CN117150194A (en)

Similar Documents

Publication Publication Date Title
Choquette et al. Nvidia a100 tensor core gpu: Performance and innovation
CN108805266B (en) Reconfigurable CNN high-concurrency convolution accelerator
KR102492477B1 (en) Matrix multiplier
CN109002659B (en) Fluid machinery simulation program optimization method based on super computer
CN105487838A (en) Task-level parallel scheduling method and system for dynamically reconfigurable processor
CN114970294B (en) Three-dimensional strain simulation PCG parallel optimization method and system based on Shenwei architecture
CN102193830A (en) Many-core environment-oriented division mapping/reduction parallel programming model
CN112446471B (en) Convolution acceleration method based on heterogeneous many-core processor
CN111860773B (en) Processing apparatus and method for information processing
CN114356836A (en) RISC-V based three-dimensional interconnected many-core processor architecture and working method thereof
CN110764774B (en) SIFT algorithm hardware acceleration method based on DSP platform
Song et al. Gpnpu: Enabling efficient hardware-based direct convolution with multi-precision support in gpu tensor cores
Lin et al. swFLOW: A dataflow deep learning framework on sunway taihulight supercomputer
CN117150194A (en) Batch processing matrix multiplication optimization realization method and system for heterogeneous processor
de Dinechin et al. Deep learning inference on the mppa3 manycore processor
Ho et al. Improving gpu throughput through parallel execution using tensor cores and cuda cores
Li et al. Dual buffer rotation four-stage pipeline for CPU–GPU cooperative computing
CN114564429A (en) Light-weight intelligent computing tight coupling structure and data processing method thereof
CN114595813A (en) Heterogeneous acceleration processor and data calculation method
US11714649B2 (en) RISC-V-based 3D interconnected multi-core processor architecture and working method thereof
CN112559032A (en) Many-core program reconstruction method based on loop segment
Li et al. PFSI. sw: A programming framework for sea ice model algorithms based on Sunway many-core processor
Zhao et al. Accelerating depthwise separable convolutions with vector processor
Liu et al. The implementation and optimization of parallel linpack on multi-core vector accelerator
Dudnik et al. Cuda architecture analysis as the driving force Of parallel calculation organization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination