CN113240570B - GEMM operation accelerator and GoogLeNet-based image processing acceleration method - Google Patents

GEMM operation accelerator and GoogLeNet-based image processing acceleration method Download PDF

Info

Publication number
CN113240570B
CN113240570B CN202110392571.4A CN202110392571A CN113240570B CN 113240570 B CN113240570 B CN 113240570B CN 202110392571 A CN202110392571 A CN 202110392571A CN 113240570 B CN113240570 B CN 113240570B
Authority
CN
China
Prior art keywords
matrix
gemm
accelerator
fragmentation
gemm operation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110392571.4A
Other languages
Chinese (zh)
Other versions
CN113240570A (en
Inventor
羊志维
陆璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Aitesi Information Technology Co ltd
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202110392571.4A priority Critical patent/CN113240570B/en
Publication of CN113240570A publication Critical patent/CN113240570A/en
Application granted granted Critical
Publication of CN113240570B publication Critical patent/CN113240570B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention belongs to the field of GEMM operation acceleration, and relates to a GEMM operation accelerator, which comprises a main circuit and a slave circuit connected with the main circuit, wherein: the main circuit firstly judges whether the number of rows and columns of a matrix is less than or equal to 1024 for a batch of input matrixes with different scales for GEMM operation: if the number of the active matrix pieces is smaller than or equal to 1024, dynamically slicing the matrix, then carrying out GEMM operation on each matrix piece by the slave circuit, and returning the GEMM operation results of the slave circuit to a caller after the GEMM operation results of the slave circuit are combined by the master circuit; and if the number of the rows or the columns of the matrix is larger than 1024, returning to a caller after obtaining an operation result by using a traditional method for solving by using a general matrix multiplication API provided by a circular calling platform. The GEMM operation accelerator of the invention utilizes dynamic fragmentation and simultaneously considers thread-level parallelism and instruction-level parallelism. The invention also provides an image processing acceleration method based on GoogLeNet.

Description

GEMM operation accelerator and GoogLeNet-based image processing acceleration method
Technical Field
The invention belongs to the field of GEMM operation acceleration, and relates to a GEMM operation accelerator and an image processing acceleration method based on GoogLeNet.
Background
BLAS (Basic Linear Algebra Subprogram) is an API standard that specifies the distribution of a numerical library of Basic Linear Algebra operations (e.g., vector or matrix multiplications). Originally released in 1979 and used to build larger numerical packages (e.g., LAPACK), BLAS was widely used in the high performance computing field. For example, the performance of LINPACK depends largely on the performance of subroutine DGEMM in BLAS. BLAS is divided into three levels by function: level 1: vector-vector operation; level 2: matrix-vector operation; level 3: matrix-matrix operation. And the BLAS for Level 3 includes a GEMM.
GEMM (General Matrix Multiplication) is a common algorithm in linear algebra, machine learning, statistics and many other fields, of the form C = α × a × B + β × C, a, B, C being matrices and α, β being scalars. Since matrix multiplication is ubiquitous in all types of scientific applications, GEMM is a primary goal of bias optimization. The GEMM optimization can play a role in accelerating operation in aspects of deep learning, astrophysics, fluid dynamics and the like.
MAGMA is a collection of new generation Linear Algebra (LA) GPU acceleration libraries designed and implemented by teams developing LAPACK and ScaLAPACK. MAGMA is suitable for GPU-based heterogeneous architectures, which support current LA packages and standard interfaces, such as LAPACK and BLAS, to allow relevant development researchers to easily migrate any LA-dependent software component. The main advantage of MAGMA is that it can enable applications to leverage the power of current multi-CPU (or multi-core CPU) and multi-GPU heterogeneous systems and provide accurate solutions at the fastest speed given the power consumption constraints. The acceleration library provided by the MAGMA contains an acceleration scheme for bulk GEMM operations called vbatch.
ROCM (Radon Open computer platform) is an AMD GPU Computing ecology based on a series of Open source projects, and is the first Open source software development platform for HPC (High Performance Computing) and ultra-large-scale GPU Computing. ROCM brought a new choice for GPU computation, namely UNIX-like, extremely simple, modular software development. Because the ROCm ecosystem consists of open source projects, it can remain viable, continue to be optimized and expanded. The open source project comprises machine learning frameworks (Tensorflow, pyTorch), libraries (MIOpen, BLAS, RCCL), programming models (HIP), and support of Linux Kernel.
The ROCM platform provides a hipbas _ Sgemmm _ batched API for processing batch GEMM operations, but is limited to GEMM operations of a batch of matrixes with the same size, for a batch of matrixes with an indefinite size, the traditional method for the batch GEMM operations is to circularly execute the hipbas _ Sgemmm API, and MAGMA is used as a mature optimization scheme which is cooperated with NVIDIA and AMD at present, compared with the traditional method, the ROCM platform provides a magmablas _ Sgemm _ vbatched API for processing batch GEMM operations of matrixes with the indefinite size. However, in the case of a smaller matrix size, the utilization of the GPU remains low, resulting in a low overall computational efficiency. For example, google lenet has 57 convolution operations, and a common algorithm for calculating convolution is to convert it into GEMM (i.e., C = α × a × B + β × C, a, B, and C are matrices, and α and β are scalars) and then operate, for the converted matrices, M (the number of rows of matrix a and matrix C), N (the number of columns of matrix B and matrix C), and K (the number of columns of matrix a and the number of rows of matrix B) are generally less than 1000, even with M of matrix less than 100, for convolution in inception _3a/5x5 u reduce, after conversion into GEMM, its size is M × N × K =16 × 784 × 192, performance on MI50 GPU is less than 1% of peak performance because the matrix is small and there is not enough work group to fully occupy the GPU after fragmentation.
At present, batch GEMM operation related to matrixes with variable sizes is applied to platforms such as CUDA (compute unified device architecture), ROCM (rock computer architecture) and the like, only related operations can be completed by circularly calling APIs such as cublas _ Sgemm and hipplas _ Sgemm, because the sizes of the matrixes related in specific application are generally small (the number of rows and columns is smaller than or equal to 1024), the utilization rate of a GPU is low, the operation efficiency is poor, MAGMA (MAGMA computer architecture) is used as a mature optimization scheme cooperating with NVIDIA (network video disk architecture) and AMD (AMD) at present, a vbatch method is improved compared with a traditional method, and the method provides the batch GEMM operation for processing the matrixes with variable sizes through APIs such as maglablas _ sgesgemmm _ batched. However, in the case of a small matrix size, the utilization rate of the GPU is still low, resulting in poor overall operation efficiency.
Disclosure of Invention
The invention provides a GEMM operation accelerator, aiming at the problems of low GEMM operation efficiency and low GPU utilization rate under the condition of small matrix scale.
The invention also provides an image processing acceleration method based on GoogLeNet.
The GEMM operation accelerator is realized by adopting the following technical scheme:
a GEMM operation accelerator comprising a master circuit and a slave circuit connected to the master circuit, wherein:
the main circuit firstly judges whether the number of rows and columns of a matrix is less than or equal to 1024 for a batch of input matrixes with different scales for GEMM operation: if the number of the active matrix pieces is smaller than or equal to 1024, dynamically slicing the matrix, then carrying out GEMM operation on each matrix piece by the slave circuit, and returning the GEMM operation results of the slave circuit to a caller after the GEMM operation results of the slave circuit are combined by the master circuit; if the number of the rows or the columns of the matrix is larger than 1024, the traditional method of solving by using a general matrix multiplication API provided by a circular calling platform is used for obtaining an operation result and then returning the operation result to a caller.
Preferably, the dynamic fragmentation process includes: and selecting the optimal slicing strategy under the current environment from a plurality of pre-established slicing strategies according to the scale of each matrix, the used GPU architecture and GPU related parameters to slice the matrix.
Preferably, the pre-established slicing policies are such that the work group size allocated to each matrixed slice is consistent.
Preferably, the method for making the word group size allocated to each matrix slice consistent comprises the following steps: the method is realized by changing the size of a sub-slice for which each word item in a single word group is responsible for operation.
Preferably, a balanced approach is adopted during dynamic fragmentation to simultaneously take account of thread-level parallelism and instruction-level parallelism.
Preferably, the balancing method comprises:
(1) calculating the number N of work items of the optimal single work group WI
Figure BDA0003017311850000041
Wherein: n is a radical of Max_WG Is the number of work items that a single work group can contain at most; n is a radical of hydrogen SIMD Is the number of SIMDs a single CU contains.
N is to be WI Comparing with the existing work item number parameter of a single work group in a plurality of pre-established fragmentation strategies, and selecting and N WI The closest value:
min{abs(N WI -T WI_i )}
T WI_i the number of work items contained in a single work group in a plurality of preset slicing strategies is determined.
(2) Screening out a feasible fragmentation strategy according to the principle that the size of the matrix fragment is smaller than that of the input matrix; respectively calculating the feasible fragmentation strategies to obtain the corresponding word group number N WG_i
Figure BDA0003017311850000043
T M_i And T N_i Is the number of rows and columns, M, of the ith slicing strategy j 、N j The number of rows and columns of the matrix C for the jth GEMM.
(3) Selecting the slicing strategy closest to the integral multiple of the CU number as the optimal slicing strategy:
min{N WG_i mod N CU }
N CU is the total CU number.
Preferably, the platform comprises CUDA, ROCm.
The GoogLeNet-based image processing acceleration method is realized by adopting the following technical scheme:
an image processing acceleration method based on GoogLeNet comprises the following steps:
the image is input into GoogLeNet after a series of pre-processing, and comes to an acceptance structure after being processed by a plurality of layers;
converting convolution operation related to 4 1 multiplied by 1 convolution kernels in an initiation structure into 4 GEMM operations, inputting the 4 GEMM operations into a GEMM operation accelerator, and processing batch GEMM operations in parallel;
the GEMM operation accelerator returns an operation result to GoogLeNet;
google lenet performs subsequent image processing steps.
Preferably, for matrixes with different scales of geomm operations input by google lenet, the GEMM operation accelerator judges whether the number of rows and columns of the matrixes is smaller than or equal to 1024: if the number is smaller than or equal to 1024, dynamically slicing is carried out, then GEMM operation is carried out on each matrix slice in parallel, and after GEMM operation results are combined, the GEMM operation results are returned to GoogLeNet; and if the number of the rows or the columns of the matrix is larger than 1024, obtaining an operation result by using a traditional method of circularly calling the general matrix multiplication API to solve, and then returning the operation result to the GooglLeNet.
Compared with the prior art, the invention has the following advantages and beneficial effects:
(1) Compared with the traditional method that under native CUDA and ROCM platforms, only cyclic brute force solution can be used for batch operation of matrixes with unequal sizes, and a vbatch method of a new generation Linear Algebra (LA) GPU acceleration library MAGMA, the GEMM operation accelerator disclosed by the invention utilizes dynamic fragmentation, and considers Thread Level Parallelism (TLP) and Instruction Level Parallelism (ILP) at the same time, so that time consumption is shorter when the number of rows and the number of columns of the matrixes are smaller than or equal to 1024.
(2) The GEMM operation acceleration can be designed based on an ROCM platform and packaged into an accelerator for platforms such as CUDA, ROCM, GPU and the like to call. Meanwhile, the GEMM matrix multiplication is ubiquitous in various scientific applications, so that the GEMM operation accelerator can be widely applied to various scenes. Such as in image processing, deep learning, astrophysics, and fluid dynamics.
(3) The invention converts 4 convolution kernels of 1 multiplied by 1 related in the input structure of GoogLeNet into GEMM, and then accelerates the operation by using a GEMM operation accelerator without using a default operation method, thereby achieving the purpose of reducing the operation time of GoogLeNet in the application of image recognition, image classification and the like.
Drawings
FIG. 1 is a flow diagram of the operation of an accelerator for GEMM operations in one embodiment;
fig. 2 is a diagram of the optimization of google lenet inclusion structure operation in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and the accompanying drawings, but the embodiments of the present invention are not limited thereto.
Example 1
A GEMM operation accelerator comprising a master circuit and a slave circuit connected to the master circuit, wherein: the main circuit firstly judges whether the number of rows and columns of a matrix is less than or equal to 1024 for a batch of input matrixes with different scales for GEMM operation: if the number of the sub-chips is smaller than or equal to 1024, dynamically splitting the matrix, then carrying out GEMM operation on each matrix chip by the slave circuit, and returning the GEMM operation results of the slave circuit to a caller after the GEMM operation results of the slave circuit are combined by the master circuit; and if the number of the rows or the columns of the matrix is larger than 1024, returning to a caller after obtaining an operation result by using a traditional method for solving by using a general matrix multiplication API provided by a circular calling platform. The work flow of the GEMM operation accelerator is shown in fig. 1.
The following description will take the operation of accelerating the google net initiation structure by the GEMM operation accelerator according to the present invention as an example.
GoogleLeNet is a model of the visual field competition ILSVRC 2014 champion (detailed in documents: szegedy C, liu W, jia Y, et al. Going decoder with contents [ C ]// Proceedings of the IEEE conference on computer vision and pattern recognition.2015: 1-9.), which saves computing resources to the maximum extent by a parameter reduction method, and proposes an interception structure for the first time, wherein the structure utilizes a multi-layer perceptron to replace a generalized linear structure in a traditional convolutional neural network, increases the width and depth of the network, and simultaneously uses a locally optimal sparse structure to replace a full connection mode of the original convolutional neural network, thereby avoiding redundancy to the maximum extent. The GoogleLeNet convolutional neural network consists of an input layer, a plurality of convolutional layers, a plurality of sub-sampling layers and an output layer, the structure has 22 layers, because the number of the layers of the neural network is very large, the abstraction capability of sample data is very strong, the number of parameters is very small, the parameters are only 5MB, the neural network is helpful for sample training and can quickly converge, and the neural network has 3 loss values, and different layer outputs can be performed (see the documents: szegedy C, vanhouck, ioffe S, et al.
For the interception structure of google lenet, as shown in fig. 2, the convolution kernel in the interception structure includes 4 convolution kernels of 1 × 1, 1 convolution kernel of 3 × 5, and 1 convolution kernel of 5 × 5. After the convolution operation related to 4 1 × 1 convolution kernels in the acceptance structure is converted into 4 GEMM operations, the default is to circularly call a universal matrix multiplication API provided by platforms such as CUDA (compute unified device architecture), ROCM (rock code division multiplexing) and the like for solving, and the traditional method has low utilization rate of the GPU, so that the operation efficiency is not high. According to the invention, the related matrix is input into the GEMM operation accelerator, and the GEMM operation accelerator processes batch GEMM operations in parallel to replace the original solving method so as to accelerate the operation of GoogLeNet in the applications of image recognition, image classification and the like.
Aiming at a batch of matrixes with different scales for GEMM operation input by GoogleLeNet, judging whether the number of rows and the number of columns of the matrixes are less than or equal to 1024, if so, dynamically fragmenting, namely, selecting an optimal strategy under the current environment from a plurality of predetermined fragmentation strategies according to the scale of each matrix, the architecture of a GPU used and related parameters to fragment the matrixes, then performing GEMM operation on each matrix fragment in parallel, returning the combined operation result to GoogleLeNet, and if the number of rows or the number of columns of the matrixes is more than 1024, obtaining the operation result by using a traditional method of circularly calling a general matrix multiplication API provided by platforms such as CUDA and ROCM to solve, and then returning the operation result to GoogleNet.
In a preferred embodiment, the slicing strategy is formulated as shown in table 1:
TABLE 1
T_M T_N T_K Work Items/Work Group
16 16 8 128
32 32 8 128
64 64 8 128
128 64 8 128
64 128 8 128
16 16 8 256
32 32 8 256
64 64 8 256
128 64 8 256
64 128 8 256
In table 1: t _ M, T _ N and T _ K are respectively the row number of a matrix slice A and a matrix slice C, the column number of the matrix slice B and the matrix slice C, the column number of the matrix slice A and the row number of the matrix slice B in GEMM (namely C = the form of. Alpha. × A × B +. Beta. × C, A, B and C are matrix slices after fragmentation, and. Alpha.,. Beta. Is scalar) operation of the matrix slice A and the matrix slice C, and Work Items/Work Group represents the number of Work Items in a single Work Group.
Compared with the traditional method that only cyclic brute force solution can be used for batch operation of matrixes with unequal sizes under native CUDA and ROCM platforms, and the vbatch method of a new generation Linear Algebra (LA) GPU acceleration library MAGMA, the method disclosed by the invention utilizes dynamic fragmentation and considers Thread Level Parallelism (TLP) and Instruction Level Parallelism (ILP) at the same time, so that the final batch GEMM operation performance is greatly improved when the number of rows and the number of columns of the matrixes are smaller than or equal to 1024.
For a series of pre-designed fragmentation strategies, in order to avoid the problem of thread idle caused by the fact that matrixes with different scales participate in batch GEMM operation, the sizes of the work groups allocated to each matrix piece are consistent through the pre-designed series of fragmentation strategies, and the sizes of the sub-pieces of each work item in a single work group, which are responsible for operation, are changed.
In one embodiment, the slicing policy is to make 10 policies for dynamic selection according to the matrix slice size and the number of word items contained in a single word group (as in table 1), and at the same time, the influence of different sizes of word groups on the sub-slice size is also considered, for example, for a matrix slice size of 16 × 16 and a word group containing 128 word items, the sub-slice size is (16 × 16)/128 =2, so let the sub-slice be equal to 2 × 1.
When the fragmentation strategy is selected, both thread-level parallelism and instruction-level parallelism need to be considered, and when a word group with more work items (for example, a word group containing 256 work items) is used, the thread-level parallelism can be improved, but the instruction-level parallelism is reduced due to the fact that sub-fragments become smaller, and the two parallelism need to be balanced.
In a preferred embodiment, a balanced approach is used for dynamic fragmentation that allows both thread-level parallelism and instruction-level parallelism. The balancing method comprises the following steps:
(1) aiming at each GEMM operation, firstly calculating the work item number N of the optimal single work group WI
Figure BDA0003017311850000091
N Max_WG Is the number of word items that a single word group can contain at most, N SIMD Is the number of SIMDs (i.e., vector processing units) that a single CU (i.e., compute unit) contains.
The closest value (the number of existing work items of a single work group) is selected by comparing with the number of work items parameter of the existing single work group in the fragmentation strategy.
min{abs(N WI -T WI_i )}
T WI_i Is the number of work items contained in a single work group in the sharding strategy.
(2) And then screening out a feasible fragment strategy according to the principle that the size of the matrix fragment is smaller than that of the input matrix. Respectively calculating the fragment strategies obtained by screening to obtain the corresponding work group quantity N WG_i
Figure BDA0003017311850000093
T M_i And T N_i Is the number of rows and columns, M, of the ith fragmentation policy j 、N j The number of rows and columns of the matrix C for the jth GEMM.
(3) After the number of work groups corresponding to the slicing strategies is calculated, the slicing strategy closest to the integral multiple of the CU number is selected as the optimal slicing strategy:
min{N WG_i mod N CU }
N CU is the total CU number.
For the GEMM operation, a matrix C is partitioned into a plurality of matrix slices with the size of X × Y, each matrix slice is obtained by operating row data (X rows) corresponding to a matrix a and column data (Y columns) corresponding to a matrix B, and the data are excessive, a single VGPR (namely a vector general register) and LDS (namely a local data cache) of the GPU cannot be accommodated at one time.
Firstly, corresponding matrix pieces of a matrix A and a matrix piece of a matrix B are placed in an LDS (System Memory) of a GPU (graphics processing unit) from the System Memory, then the matrix pieces of the matrix A and the matrix B are divided into a plurality of sub-pieces according to the number of work items of a single work group in a slicing strategy from the LDS, then the corresponding sub-pieces are placed in a VGPR (virtual root graph regression) to calculate the GEMM operation, and finally, the sub-pieces are combined to obtain a final result so as to fully utilize the thread-level parallelism.
The method comprises the following specific steps:
selecting a fragmentation strategy
As shown in fig. 1, google lenet inputs a batch of matrixes with variable sizes for calculation, and as the invention essentially increases the utilization rate of the GPU as much as possible to improve the operation efficiency under the condition that the GPU cannot be fully utilized, compared with the traditional method and the MAGMA vbatch method, the GEMM operation accelerator of the invention is more suitable for the matrix operation of small matrixes, and google lenet often relates to the matrix operation with the number of rows and columns less than or equal to 1024 in practical application, so the operation of google lenet in scenes such as image recognition, image classification and the like can be accelerated. The GEMM operation accelerator firstly judges whether the number of rows and columns of a batch of paired matrixes input by GoogleLeNet is smaller than or equal to 1024, calculates the matrixes larger than 1024 by using a traditional method, and calculates the matrixes smaller than or equal to 1024 by using an optimized method.
In order to select an optimal fragmentation strategy from the pre-established fragmentation strategies, a current GPU architecture and related parameters need to be acquired first, including: the number of word items that a single word group can contain at most, the number of SIMDs contained in a single CU, and the total CU number.
Firstly, calculating the number N of work items of an optimal single work group WI
Figure BDA0003017311850000111
N Max_WG Is the number of word items that the maximum word group can contain, N SIMD Is the number of SIMDs that a single CU contains. And selecting the closest value by comparing the number parameter with the number parameter of the work item of the existing single work group in the slicing strategy.
min{abs(N WI -T WI_i )}
T WI_i Is the number of work items contained in a single work group in the sharding strategy.
And then screening out a feasible fragmentation strategy according to the principle that the size of the matrix fragment is smaller than that of the input matrix. Respectively calculating the fragment strategies obtained by screening to obtain the corresponding work group quantity N WG_i
Figure BDA0003017311850000113
T M_i And T N_i Is the number of rows and columns, M, of the ith slicing strategy j 、N j The number of rows and columns of the matrix C for the jth GEMM.
After the number of work groups corresponding to the slicing strategies is calculated, the slicing strategy closest to the integral multiple of the number of CUs is selected as the optimal slicing strategy:
min{N WG_i mod N CU }
N CU is the total CU number.
(II) slice computation
The paired matrixes stored in a system memory are segmented according to a selected segmentation strategy, in order to improve transmission efficiency, the corresponding segmented matrix segments are stored in LDS (local data cache) in each CU (computing unit) one by one, the matrix segments of a matrix A and a matrix B in the LDS are divided into a plurality of sub-segments according to the number of work items of a single work group in the segmentation strategy, the corresponding sub-segments are transmitted into VGPR (vector general register) of each SIMD (vector processing unit) one by one, and then the GEMM operation is calculated by utilizing the SIMD.
(III) combining the calculated results
And after the operation of each sub-chip is finished, storing the calculation result back to the original address of the LDS, after the operation of the whole matrix chip is finished, storing the calculation result back to the original address of the system memory, and after the operation of all the matrix chips is finished, returning the whole GEMM operation result to GoogLeNet.
Example 2
An image processing acceleration method based on GooglLeNet comprises the following steps:
after a series of pre-processing, the image is input to google lenet, the data is processed through several layers to an acceptance structure, the convolution kernel in the acceptance structure includes 4 convolution kernels of 1 × 1, 1 convolution kernel of 3 × 5 and 1 convolution kernel of 5 × 5, under CUDA or rock environment, the method of calculating the convolution is to convert the convolution into GEMM (i.e. C = the form of α × a × B + β × C, a, B, and C are matrixes, and α and β are scalars) and then to operate, for the converted matrixes, M (the number of rows of the matrixes a and C), N (the number of columns of the matrixes B and C), K (the number of columns of the matrix a and the number of rows of the matrix B) are generally less than 1000, even the convolution in the matrix is less than 100, for example, the convolution in the acceptance _3a/5 × 5 reduce is converted into GEMM, the size of which after conversion into GEMM × N × K =16 × 784, and the performance of the GPU is less than 192%, and the GPU performance is not enough to occupy the GPU after conversion, the GPU is a small fragment. Under the environment of CUDA or ROCM, the GEMM subroutines are called in series to solve the matrix operation related to the initiation structure, and the GEMM operation accelerator related to the invention puts the matrix operations related to 4 convolution kernels of 1 × 1 together for unified processing, so that the utilization rate of the GPU is improved, and the effect of accelerating image processing is finally achieved.
The above examples are preferred embodiments of the present invention, and the objects, technical solutions and advantages of the present invention are further described in detail, but the embodiments of the present invention are not limited by the above examples, and any other modifications, equivalent substitutions, improvements and the like without departing from the spirit and principle of the present invention should be considered as equivalent replacements within the protection scope of the present invention.

Claims (8)

1. A GEMM operation accelerator comprising a master circuit and a slave circuit connected to the master circuit, wherein:
aiming at a batch of input matrixes with different scales for GEMM operation, the main circuit firstly judges whether the number of rows and the number of columns of the matrixes are less than or equal to 1024: if the number of the sub-circuits is smaller than or equal to 1024, dynamically partitioning the matrix, then carrying out GEMM operation on each matrix slice by the slave circuit, and returning the GEMM operation results of the slave circuit to a caller after the GEMM operation results of the slave circuit are merged by the master circuit; if the number of rows or columns of the matrix is larger than 1024, obtaining an operation result by using a traditional method for solving a general matrix multiplication API provided by a circular calling platform and then returning the operation result to a caller;
during dynamic fragmentation, a balancing method is adopted to simultaneously consider thread-level parallelism and instruction-level parallelism, and comprises the following steps:
(1) calculating the number N of work items of the optimal single work group WI
Figure FDA0003766900560000011
Wherein: n is a radical of Max_WG Is the number of word items that a single word group can contain at most; n is a radical of SIMD Is the number of SIMDs a single CU contains;
will N WI Comparing with the existing work item quantity parameter of a single work group in a plurality of pre-established fragmentation strategies, and selecting and N WI The closest value:
min{abs(N WI -T WI_i )}
T WI_i the number of work items contained in a single work group in a plurality of preset fragmentation strategies is determined;
(2) according to the size of the matrix sheetScreening out a feasible fragmentation strategy according to the principle that the size of the input matrix is smaller than that of the input matrix; respectively calculating the feasible fragmentation strategy to obtain the corresponding work group number N WG_i
Figure FDA0003766900560000012
T M_i And T N_i Is the number of rows and columns, M, of the ith slicing strategy j 、N j The number of rows and columns of the matrix C of the jth GEMM;
(3) selecting the slicing strategy closest to the integral multiple of the CU number as the optimal slicing strategy:
min{N WG_i mod N CU }
N CU is the total CU number.
2. The GEMM operation accelerator of claim 1, wherein the dynamic slicing procedure comprises: and selecting an optimal fragmentation strategy under the current environment from a plurality of pre-established fragmentation strategies according to the scale of each matrix, the used GPU architecture and GPU related parameters to fragment the matrix.
3. The GEMM arithmetic accelerator of claim 2, wherein the predetermined plurality of fragmentation strategies are such that the work group size assigned to each matrix fragment is consistent.
4. The GEMM arithmetic accelerator of claim 3, wherein the means for conforming the work group size assigned to each tile comprises: by changing the sub-slice size of the operation responsible for each word item in a single word group.
5. The GEMM computing accelerator of claim 1, wherein the platform comprises CUDA, ROCm.
6. An image processing acceleration method based on GoogleLeNet, characterized in that, based on the GEMM operation accelerator implementation of any one of claims 1-4, comprising:
the image is input into GoogLeNet after a series of pre-processing, and is processed into an acceptance structure after a plurality of layers;
converting convolution operation related to 4 1 multiplied by 1 convolution kernels in an increment structure into 4 GEMM operation, inputting the 4 GEMM operation into a GEMM operation accelerator, and processing batch GEMM operation in parallel;
the GEMM operation accelerator returns an operation result to GoogLeNet;
google lenet performs subsequent image processing steps.
7. The image processing acceleration method according to claim 6, characterized in that, for a matrix with unequal sizes of GEMM operations input by google lenet, the GEMM operation accelerator determines whether the number of rows and columns of the matrix is less than or equal to 1024: if the number of the matrix pieces is smaller than or equal to 1024, performing dynamic fragmentation, then performing GEMM operation on each matrix piece in parallel, combining GEMM operation results and returning the GEMM operation results to GoogleLeNet; if the number of the rows or the number of the columns of the matrix is larger than 1024, the traditional method of circularly calling the general matrix multiplication API to solve is used for obtaining an operation result, and then the operation result is returned to GoogLeNet.
8. The image processing acceleration method of claim 7, characterized in that the dynamic slicing procedure comprises: and selecting an optimal fragmentation strategy under the current environment from a plurality of pre-established fragmentation strategies according to the scale of each matrix, the used GPU architecture and GPU related parameters to fragment the matrix.
CN202110392571.4A 2021-04-13 2021-04-13 GEMM operation accelerator and GoogLeNet-based image processing acceleration method Active CN113240570B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110392571.4A CN113240570B (en) 2021-04-13 2021-04-13 GEMM operation accelerator and GoogLeNet-based image processing acceleration method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110392571.4A CN113240570B (en) 2021-04-13 2021-04-13 GEMM operation accelerator and GoogLeNet-based image processing acceleration method

Publications (2)

Publication Number Publication Date
CN113240570A CN113240570A (en) 2021-08-10
CN113240570B true CN113240570B (en) 2023-01-06

Family

ID=77128680

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110392571.4A Active CN113240570B (en) 2021-04-13 2021-04-13 GEMM operation accelerator and GoogLeNet-based image processing acceleration method

Country Status (1)

Country Link
CN (1) CN113240570B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110245751A (en) * 2017-08-31 2019-09-17 北京中科寒武纪科技有限公司 A kind of GEMM operation operation method and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808309B (en) * 2016-03-08 2019-04-05 中国科学院软件研究所 A kind of high-performance implementation method of the basic linear algebra library BLAS three-level function GEMM based on Shen prestige platform
US10073815B2 (en) * 2016-05-31 2018-09-11 Palo Alto Research Cener Incorporated System and method for speeding up general matrix-matrix multiplication on the GPU
US10657442B2 (en) * 2018-04-19 2020-05-19 International Business Machines Corporation Deep learning accelerator architecture with chunking GEMM
US11580386B2 (en) * 2019-03-18 2023-02-14 Electronics And Telecommunications Research Institute Convolutional layer acceleration unit, embedded system having the same, and method for operating the embedded system
CN110246078B (en) * 2019-05-31 2020-11-03 北京航空航天大学 Image processing method and device based on embedded GPU and convolution calculation

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110245751A (en) * 2017-08-31 2019-09-17 北京中科寒武纪科技有限公司 A kind of GEMM operation operation method and device

Also Published As

Publication number Publication date
CN113240570A (en) 2021-08-10

Similar Documents

Publication Publication Date Title
Guo et al. Software-hardware codesign for efficient neural network acceleration
Norrie et al. The design process for Google's training chips: TPUv2 and TPUv3
CN109543830B (en) Splitting accumulator for convolutional neural network accelerator
US11442785B2 (en) Computation method and product thereof
CN107301456B (en) Deep neural network multi-core acceleration implementation method based on vector processor
Zhou et al. Transpim: A memory-based acceleration via software-hardware co-design for transformer
CN111062472B (en) Sparse neural network accelerator based on structured pruning and acceleration method thereof
Hu et al. A survey on convolutional neural network accelerators: GPU, FPGA and ASIC
Hong et al. Dfx: A low-latency multi-fpga appliance for accelerating transformer-based text generation
US20230026006A1 (en) Convolution computation engine, artificial intelligence chip, and data processing method
CN111626414B (en) Dynamic multi-precision neural network acceleration unit
CN113469350B (en) Deep convolutional neural network acceleration method and system suitable for NPU
US11886347B2 (en) Large-scale data processing computer architecture
Asgari et al. Meissa: Multiplying matrices efficiently in a scalable systolic architecture
Zhu et al. Taming unstructured sparsity on GPUs via latency-aware optimization
Kalali et al. Near-precise parameter approximation for multiple multiplications on a single dsp block
Jang et al. Smart-Infinity: Fast Large Language Model Training using Near-Storage Processing on a Real System
CN113240570B (en) GEMM operation accelerator and GoogLeNet-based image processing acceleration method
CN116167304B9 (en) Reservoir value based on Shenwei architecture simulation GMRES optimization method and system
Wang et al. A novel parallel algorithm for sparse tensor matrix chain multiplication via tcu-acceleration
CN116090519A (en) Compiling method of convolution operator and related product
Shyamala et al. Design and implementation of GPU-based matrix chain multiplication using C++ AMP
CN115222028A (en) One-dimensional CNN-LSTM acceleration platform based on FPGA and implementation method
Gao et al. Revisiting thread configuration of SpMV kernels on GPU: A machine learning based approach
CN220983883U (en) Matrix computing device, chiplet apparatus and artificial intelligence accelerator device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230915

Address after: 518000, Zone 2111, Area A, 2nd Floor, Building R2-B, Gaoxin Industrial Village, No. 020 Gaoxin South Seventh Road, Gaoxin Community, Yuehai Street, Nanshan District, Shenzhen City, Guangdong Province

Patentee after: Shenzhen Xiangruilai Technology Co.,Ltd.

Address before: 510640 No. five, 381 mountain road, Guangzhou, Guangdong, Tianhe District

Patentee before: SOUTH CHINA University OF TECHNOLOGY

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20231027

Address after: B508, Unit 1, Building 6, Shenzhen Software Park, No. 2 Gaoxin Middle Road, Maling Community, Yuehai Street, Nanshan District, Shenzhen City, Guangdong Province, 518000

Patentee after: Shenzhen Aitesi Information Technology Co.,Ltd.

Address before: 518000, Zone 2111, Area A, 2nd Floor, Building R2-B, Gaoxin Industrial Village, No. 020 Gaoxin South Seventh Road, Gaoxin Community, Yuehai Street, Nanshan District, Shenzhen City, Guangdong Province

Patentee before: Shenzhen Xiangruilai Technology Co.,Ltd.