Background
BLAS (Basic Linear Algebra Subprogram) is an API standard that specifies the distribution of a numerical library of Basic Linear Algebra operations (e.g., vector or matrix multiplications). Originally released in 1979 and used to build larger numerical packages (e.g., LAPACK), BLAS was widely used in the high performance computing field. For example, the performance of LINPACK depends largely on the performance of subroutine DGEMM in BLAS. BLAS is divided into three levels by function: level 1: vector-vector operation; level 2: matrix-vector operation; level 3: matrix-matrix operation. And the BLAS for Level 3 includes a GEMM.
GEMM (General Matrix Multiplication) is a common algorithm in linear algebra, machine learning, statistics and many other fields, of the form C = α × a × B + β × C, a, B, C being matrices and α, β being scalars. Since matrix multiplication is ubiquitous in all types of scientific applications, GEMM is a primary goal of bias optimization. The GEMM optimization can play a role in accelerating operation in aspects of deep learning, astrophysics, fluid dynamics and the like.
MAGMA is a collection of new generation Linear Algebra (LA) GPU acceleration libraries designed and implemented by teams developing LAPACK and ScaLAPACK. MAGMA is suitable for GPU-based heterogeneous architectures, which support current LA packages and standard interfaces, such as LAPACK and BLAS, to allow relevant development researchers to easily migrate any LA-dependent software component. The main advantage of MAGMA is that it can enable applications to leverage the power of current multi-CPU (or multi-core CPU) and multi-GPU heterogeneous systems and provide accurate solutions at the fastest speed given the power consumption constraints. The acceleration library provided by the MAGMA contains an acceleration scheme for bulk GEMM operations called vbatch.
ROCM (Radon Open computer platform) is an AMD GPU Computing ecology based on a series of Open source projects, and is the first Open source software development platform for HPC (High Performance Computing) and ultra-large-scale GPU Computing. ROCM brought a new choice for GPU computation, namely UNIX-like, extremely simple, modular software development. Because the ROCm ecosystem consists of open source projects, it can remain viable, continue to be optimized and expanded. The open source project comprises machine learning frameworks (Tensorflow, pyTorch), libraries (MIOpen, BLAS, RCCL), programming models (HIP), and support of Linux Kernel.
The ROCM platform provides a hipbas _ Sgemmm _ batched API for processing batch GEMM operations, but is limited to GEMM operations of a batch of matrixes with the same size, for a batch of matrixes with an indefinite size, the traditional method for the batch GEMM operations is to circularly execute the hipbas _ Sgemmm API, and MAGMA is used as a mature optimization scheme which is cooperated with NVIDIA and AMD at present, compared with the traditional method, the ROCM platform provides a magmablas _ Sgemm _ vbatched API for processing batch GEMM operations of matrixes with the indefinite size. However, in the case of a smaller matrix size, the utilization of the GPU remains low, resulting in a low overall computational efficiency. For example, google lenet has 57 convolution operations, and a common algorithm for calculating convolution is to convert it into GEMM (i.e., C = α × a × B + β × C, a, B, and C are matrices, and α and β are scalars) and then operate, for the converted matrices, M (the number of rows of matrix a and matrix C), N (the number of columns of matrix B and matrix C), and K (the number of columns of matrix a and the number of rows of matrix B) are generally less than 1000, even with M of matrix less than 100, for convolution in inception _3a/5x5 u reduce, after conversion into GEMM, its size is M × N × K =16 × 784 × 192, performance on MI50 GPU is less than 1% of peak performance because the matrix is small and there is not enough work group to fully occupy the GPU after fragmentation.
At present, batch GEMM operation related to matrixes with variable sizes is applied to platforms such as CUDA (compute unified device architecture), ROCM (rock computer architecture) and the like, only related operations can be completed by circularly calling APIs such as cublas _ Sgemm and hipplas _ Sgemm, because the sizes of the matrixes related in specific application are generally small (the number of rows and columns is smaller than or equal to 1024), the utilization rate of a GPU is low, the operation efficiency is poor, MAGMA (MAGMA computer architecture) is used as a mature optimization scheme cooperating with NVIDIA (network video disk architecture) and AMD (AMD) at present, a vbatch method is improved compared with a traditional method, and the method provides the batch GEMM operation for processing the matrixes with variable sizes through APIs such as maglablas _ sgesgemmm _ batched. However, in the case of a small matrix size, the utilization rate of the GPU is still low, resulting in poor overall operation efficiency.
Disclosure of Invention
The invention provides a GEMM operation accelerator, aiming at the problems of low GEMM operation efficiency and low GPU utilization rate under the condition of small matrix scale.
The invention also provides an image processing acceleration method based on GoogLeNet.
The GEMM operation accelerator is realized by adopting the following technical scheme:
a GEMM operation accelerator comprising a master circuit and a slave circuit connected to the master circuit, wherein:
the main circuit firstly judges whether the number of rows and columns of a matrix is less than or equal to 1024 for a batch of input matrixes with different scales for GEMM operation: if the number of the active matrix pieces is smaller than or equal to 1024, dynamically slicing the matrix, then carrying out GEMM operation on each matrix piece by the slave circuit, and returning the GEMM operation results of the slave circuit to a caller after the GEMM operation results of the slave circuit are combined by the master circuit; if the number of the rows or the columns of the matrix is larger than 1024, the traditional method of solving by using a general matrix multiplication API provided by a circular calling platform is used for obtaining an operation result and then returning the operation result to a caller.
Preferably, the dynamic fragmentation process includes: and selecting the optimal slicing strategy under the current environment from a plurality of pre-established slicing strategies according to the scale of each matrix, the used GPU architecture and GPU related parameters to slice the matrix.
Preferably, the pre-established slicing policies are such that the work group size allocated to each matrixed slice is consistent.
Preferably, the method for making the word group size allocated to each matrix slice consistent comprises the following steps: the method is realized by changing the size of a sub-slice for which each word item in a single word group is responsible for operation.
Preferably, a balanced approach is adopted during dynamic fragmentation to simultaneously take account of thread-level parallelism and instruction-level parallelism.
Preferably, the balancing method comprises:
(1) calculating the number N of work items of the optimal single work group WI :
Wherein: n is a radical of Max_WG Is the number of work items that a single work group can contain at most; n is a radical of hydrogen SIMD Is the number of SIMDs a single CU contains.
N is to be WI Comparing with the existing work item number parameter of a single work group in a plurality of pre-established fragmentation strategies, and selecting and N WI The closest value:
min{abs(N WI -T WI_i )}
T WI_i the number of work items contained in a single work group in a plurality of preset slicing strategies is determined.
(2) Screening out a feasible fragmentation strategy according to the principle that the size of the matrix fragment is smaller than that of the input matrix; respectively calculating the feasible fragmentation strategies to obtain the corresponding word group number N WG_i :
T M_i And T N_i Is the number of rows and columns, M, of the ith slicing strategy j 、N j The number of rows and columns of the matrix C for the jth GEMM.
(3) Selecting the slicing strategy closest to the integral multiple of the CU number as the optimal slicing strategy:
min{N WG_i mod N CU }
N CU is the total CU number.
Preferably, the platform comprises CUDA, ROCm.
The GoogLeNet-based image processing acceleration method is realized by adopting the following technical scheme:
an image processing acceleration method based on GoogLeNet comprises the following steps:
the image is input into GoogLeNet after a series of pre-processing, and comes to an acceptance structure after being processed by a plurality of layers;
converting convolution operation related to 4 1 multiplied by 1 convolution kernels in an initiation structure into 4 GEMM operations, inputting the 4 GEMM operations into a GEMM operation accelerator, and processing batch GEMM operations in parallel;
the GEMM operation accelerator returns an operation result to GoogLeNet;
google lenet performs subsequent image processing steps.
Preferably, for matrixes with different scales of geomm operations input by google lenet, the GEMM operation accelerator judges whether the number of rows and columns of the matrixes is smaller than or equal to 1024: if the number is smaller than or equal to 1024, dynamically slicing is carried out, then GEMM operation is carried out on each matrix slice in parallel, and after GEMM operation results are combined, the GEMM operation results are returned to GoogLeNet; and if the number of the rows or the columns of the matrix is larger than 1024, obtaining an operation result by using a traditional method of circularly calling the general matrix multiplication API to solve, and then returning the operation result to the GooglLeNet.
Compared with the prior art, the invention has the following advantages and beneficial effects:
(1) Compared with the traditional method that under native CUDA and ROCM platforms, only cyclic brute force solution can be used for batch operation of matrixes with unequal sizes, and a vbatch method of a new generation Linear Algebra (LA) GPU acceleration library MAGMA, the GEMM operation accelerator disclosed by the invention utilizes dynamic fragmentation, and considers Thread Level Parallelism (TLP) and Instruction Level Parallelism (ILP) at the same time, so that time consumption is shorter when the number of rows and the number of columns of the matrixes are smaller than or equal to 1024.
(2) The GEMM operation acceleration can be designed based on an ROCM platform and packaged into an accelerator for platforms such as CUDA, ROCM, GPU and the like to call. Meanwhile, the GEMM matrix multiplication is ubiquitous in various scientific applications, so that the GEMM operation accelerator can be widely applied to various scenes. Such as in image processing, deep learning, astrophysics, and fluid dynamics.
(3) The invention converts 4 convolution kernels of 1 multiplied by 1 related in the input structure of GoogLeNet into GEMM, and then accelerates the operation by using a GEMM operation accelerator without using a default operation method, thereby achieving the purpose of reducing the operation time of GoogLeNet in the application of image recognition, image classification and the like.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and the accompanying drawings, but the embodiments of the present invention are not limited thereto.
Example 1
A GEMM operation accelerator comprising a master circuit and a slave circuit connected to the master circuit, wherein: the main circuit firstly judges whether the number of rows and columns of a matrix is less than or equal to 1024 for a batch of input matrixes with different scales for GEMM operation: if the number of the sub-chips is smaller than or equal to 1024, dynamically splitting the matrix, then carrying out GEMM operation on each matrix chip by the slave circuit, and returning the GEMM operation results of the slave circuit to a caller after the GEMM operation results of the slave circuit are combined by the master circuit; and if the number of the rows or the columns of the matrix is larger than 1024, returning to a caller after obtaining an operation result by using a traditional method for solving by using a general matrix multiplication API provided by a circular calling platform. The work flow of the GEMM operation accelerator is shown in fig. 1.
The following description will take the operation of accelerating the google net initiation structure by the GEMM operation accelerator according to the present invention as an example.
GoogleLeNet is a model of the visual field competition ILSVRC 2014 champion (detailed in documents: szegedy C, liu W, jia Y, et al. Going decoder with contents [ C ]// Proceedings of the IEEE conference on computer vision and pattern recognition.2015: 1-9.), which saves computing resources to the maximum extent by a parameter reduction method, and proposes an interception structure for the first time, wherein the structure utilizes a multi-layer perceptron to replace a generalized linear structure in a traditional convolutional neural network, increases the width and depth of the network, and simultaneously uses a locally optimal sparse structure to replace a full connection mode of the original convolutional neural network, thereby avoiding redundancy to the maximum extent. The GoogleLeNet convolutional neural network consists of an input layer, a plurality of convolutional layers, a plurality of sub-sampling layers and an output layer, the structure has 22 layers, because the number of the layers of the neural network is very large, the abstraction capability of sample data is very strong, the number of parameters is very small, the parameters are only 5MB, the neural network is helpful for sample training and can quickly converge, and the neural network has 3 loss values, and different layer outputs can be performed (see the documents: szegedy C, vanhouck, ioffe S, et al.
For the interception structure of google lenet, as shown in fig. 2, the convolution kernel in the interception structure includes 4 convolution kernels of 1 × 1, 1 convolution kernel of 3 × 5, and 1 convolution kernel of 5 × 5. After the convolution operation related to 4 1 × 1 convolution kernels in the acceptance structure is converted into 4 GEMM operations, the default is to circularly call a universal matrix multiplication API provided by platforms such as CUDA (compute unified device architecture), ROCM (rock code division multiplexing) and the like for solving, and the traditional method has low utilization rate of the GPU, so that the operation efficiency is not high. According to the invention, the related matrix is input into the GEMM operation accelerator, and the GEMM operation accelerator processes batch GEMM operations in parallel to replace the original solving method so as to accelerate the operation of GoogLeNet in the applications of image recognition, image classification and the like.
Aiming at a batch of matrixes with different scales for GEMM operation input by GoogleLeNet, judging whether the number of rows and the number of columns of the matrixes are less than or equal to 1024, if so, dynamically fragmenting, namely, selecting an optimal strategy under the current environment from a plurality of predetermined fragmentation strategies according to the scale of each matrix, the architecture of a GPU used and related parameters to fragment the matrixes, then performing GEMM operation on each matrix fragment in parallel, returning the combined operation result to GoogleLeNet, and if the number of rows or the number of columns of the matrixes is more than 1024, obtaining the operation result by using a traditional method of circularly calling a general matrix multiplication API provided by platforms such as CUDA and ROCM to solve, and then returning the operation result to GoogleNet.
In a preferred embodiment, the slicing strategy is formulated as shown in table 1:
TABLE 1
T_M
|
T_N
|
T_K
|
Work Items/Work Group
|
16
|
16
|
8
|
128
|
32
|
32
|
8
|
128
|
64
|
64
|
8
|
128
|
128
|
64
|
8
|
128
|
64
|
128
|
8
|
128
|
16
|
16
|
8
|
256
|
32
|
32
|
8
|
256
|
64
|
64
|
8
|
256
|
128
|
64
|
8
|
256
|
64
|
128
|
8
|
256 |
In table 1: t _ M, T _ N and T _ K are respectively the row number of a matrix slice A and a matrix slice C, the column number of the matrix slice B and the matrix slice C, the column number of the matrix slice A and the row number of the matrix slice B in GEMM (namely C = the form of. Alpha. × A × B +. Beta. × C, A, B and C are matrix slices after fragmentation, and. Alpha.,. Beta. Is scalar) operation of the matrix slice A and the matrix slice C, and Work Items/Work Group represents the number of Work Items in a single Work Group.
Compared with the traditional method that only cyclic brute force solution can be used for batch operation of matrixes with unequal sizes under native CUDA and ROCM platforms, and the vbatch method of a new generation Linear Algebra (LA) GPU acceleration library MAGMA, the method disclosed by the invention utilizes dynamic fragmentation and considers Thread Level Parallelism (TLP) and Instruction Level Parallelism (ILP) at the same time, so that the final batch GEMM operation performance is greatly improved when the number of rows and the number of columns of the matrixes are smaller than or equal to 1024.
For a series of pre-designed fragmentation strategies, in order to avoid the problem of thread idle caused by the fact that matrixes with different scales participate in batch GEMM operation, the sizes of the work groups allocated to each matrix piece are consistent through the pre-designed series of fragmentation strategies, and the sizes of the sub-pieces of each work item in a single work group, which are responsible for operation, are changed.
In one embodiment, the slicing policy is to make 10 policies for dynamic selection according to the matrix slice size and the number of word items contained in a single word group (as in table 1), and at the same time, the influence of different sizes of word groups on the sub-slice size is also considered, for example, for a matrix slice size of 16 × 16 and a word group containing 128 word items, the sub-slice size is (16 × 16)/128 =2, so let the sub-slice be equal to 2 × 1.
When the fragmentation strategy is selected, both thread-level parallelism and instruction-level parallelism need to be considered, and when a word group with more work items (for example, a word group containing 256 work items) is used, the thread-level parallelism can be improved, but the instruction-level parallelism is reduced due to the fact that sub-fragments become smaller, and the two parallelism need to be balanced.
In a preferred embodiment, a balanced approach is used for dynamic fragmentation that allows both thread-level parallelism and instruction-level parallelism. The balancing method comprises the following steps:
(1) aiming at each GEMM operation, firstly calculating the work item number N of the optimal single work group WI :
N Max_WG Is the number of word items that a single word group can contain at most, N SIMD Is the number of SIMDs (i.e., vector processing units) that a single CU (i.e., compute unit) contains.
The closest value (the number of existing work items of a single work group) is selected by comparing with the number of work items parameter of the existing single work group in the fragmentation strategy.
min{abs(N WI -T WI_i )}
T WI_i Is the number of work items contained in a single work group in the sharding strategy.
(2) And then screening out a feasible fragment strategy according to the principle that the size of the matrix fragment is smaller than that of the input matrix. Respectively calculating the fragment strategies obtained by screening to obtain the corresponding work group quantity N WG_i :
T M_i And T N_i Is the number of rows and columns, M, of the ith fragmentation policy j 、N j The number of rows and columns of the matrix C for the jth GEMM.
(3) After the number of work groups corresponding to the slicing strategies is calculated, the slicing strategy closest to the integral multiple of the CU number is selected as the optimal slicing strategy:
min{N WG_i mod N CU }
N CU is the total CU number.
For the GEMM operation, a matrix C is partitioned into a plurality of matrix slices with the size of X × Y, each matrix slice is obtained by operating row data (X rows) corresponding to a matrix a and column data (Y columns) corresponding to a matrix B, and the data are excessive, a single VGPR (namely a vector general register) and LDS (namely a local data cache) of the GPU cannot be accommodated at one time.
Firstly, corresponding matrix pieces of a matrix A and a matrix piece of a matrix B are placed in an LDS (System Memory) of a GPU (graphics processing unit) from the System Memory, then the matrix pieces of the matrix A and the matrix B are divided into a plurality of sub-pieces according to the number of work items of a single work group in a slicing strategy from the LDS, then the corresponding sub-pieces are placed in a VGPR (virtual root graph regression) to calculate the GEMM operation, and finally, the sub-pieces are combined to obtain a final result so as to fully utilize the thread-level parallelism.
The method comprises the following specific steps:
selecting a fragmentation strategy
As shown in fig. 1, google lenet inputs a batch of matrixes with variable sizes for calculation, and as the invention essentially increases the utilization rate of the GPU as much as possible to improve the operation efficiency under the condition that the GPU cannot be fully utilized, compared with the traditional method and the MAGMA vbatch method, the GEMM operation accelerator of the invention is more suitable for the matrix operation of small matrixes, and google lenet often relates to the matrix operation with the number of rows and columns less than or equal to 1024 in practical application, so the operation of google lenet in scenes such as image recognition, image classification and the like can be accelerated. The GEMM operation accelerator firstly judges whether the number of rows and columns of a batch of paired matrixes input by GoogleLeNet is smaller than or equal to 1024, calculates the matrixes larger than 1024 by using a traditional method, and calculates the matrixes smaller than or equal to 1024 by using an optimized method.
In order to select an optimal fragmentation strategy from the pre-established fragmentation strategies, a current GPU architecture and related parameters need to be acquired first, including: the number of word items that a single word group can contain at most, the number of SIMDs contained in a single CU, and the total CU number.
Firstly, calculating the number N of work items of an optimal single work group WI :
N Max_WG Is the number of word items that the maximum word group can contain, N SIMD Is the number of SIMDs that a single CU contains. And selecting the closest value by comparing the number parameter with the number parameter of the work item of the existing single work group in the slicing strategy.
min{abs(N WI -T WI_i )}
T WI_i Is the number of work items contained in a single work group in the sharding strategy.
And then screening out a feasible fragmentation strategy according to the principle that the size of the matrix fragment is smaller than that of the input matrix. Respectively calculating the fragment strategies obtained by screening to obtain the corresponding work group quantity N WG_i :
T M_i And T N_i Is the number of rows and columns, M, of the ith slicing strategy j 、N j The number of rows and columns of the matrix C for the jth GEMM.
After the number of work groups corresponding to the slicing strategies is calculated, the slicing strategy closest to the integral multiple of the number of CUs is selected as the optimal slicing strategy:
min{N WG_i mod N CU }
N CU is the total CU number.
(II) slice computation
The paired matrixes stored in a system memory are segmented according to a selected segmentation strategy, in order to improve transmission efficiency, the corresponding segmented matrix segments are stored in LDS (local data cache) in each CU (computing unit) one by one, the matrix segments of a matrix A and a matrix B in the LDS are divided into a plurality of sub-segments according to the number of work items of a single work group in the segmentation strategy, the corresponding sub-segments are transmitted into VGPR (vector general register) of each SIMD (vector processing unit) one by one, and then the GEMM operation is calculated by utilizing the SIMD.
(III) combining the calculated results
And after the operation of each sub-chip is finished, storing the calculation result back to the original address of the LDS, after the operation of the whole matrix chip is finished, storing the calculation result back to the original address of the system memory, and after the operation of all the matrix chips is finished, returning the whole GEMM operation result to GoogLeNet.
Example 2
An image processing acceleration method based on GooglLeNet comprises the following steps:
after a series of pre-processing, the image is input to google lenet, the data is processed through several layers to an acceptance structure, the convolution kernel in the acceptance structure includes 4 convolution kernels of 1 × 1, 1 convolution kernel of 3 × 5 and 1 convolution kernel of 5 × 5, under CUDA or rock environment, the method of calculating the convolution is to convert the convolution into GEMM (i.e. C = the form of α × a × B + β × C, a, B, and C are matrixes, and α and β are scalars) and then to operate, for the converted matrixes, M (the number of rows of the matrixes a and C), N (the number of columns of the matrixes B and C), K (the number of columns of the matrix a and the number of rows of the matrix B) are generally less than 1000, even the convolution in the matrix is less than 100, for example, the convolution in the acceptance _3a/5 × 5 reduce is converted into GEMM, the size of which after conversion into GEMM × N × K =16 × 784, and the performance of the GPU is less than 192%, and the GPU performance is not enough to occupy the GPU after conversion, the GPU is a small fragment. Under the environment of CUDA or ROCM, the GEMM subroutines are called in series to solve the matrix operation related to the initiation structure, and the GEMM operation accelerator related to the invention puts the matrix operations related to 4 convolution kernels of 1 × 1 together for unified processing, so that the utilization rate of the GPU is improved, and the effect of accelerating image processing is finally achieved.
The above examples are preferred embodiments of the present invention, and the objects, technical solutions and advantages of the present invention are further described in detail, but the embodiments of the present invention are not limited by the above examples, and any other modifications, equivalent substitutions, improvements and the like without departing from the spirit and principle of the present invention should be considered as equivalent replacements within the protection scope of the present invention.