CN115328439A

CN115328439A - Incremental matrix multiplication accelerator applied to HPC/AI

Info

Publication number: CN115328439A
Application number: CN202210847517.9A
Authority: CN
Inventors: 文梅; 沈俊忠; 汪志; 薛泽宇; 曹亚松; 刘胜; 雷元武; 陈小文; 郭阳; 汤珉琎; 杨建超; 杨韧禹; 李宇航; 康宇晗; 黄浩岚; 方亚豪; 鞠鑫; 冯静
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2022-07-19
Filing date: 2022-07-19
Publication date: 2022-11-11

Abstract

The invention discloses an incremental matrix multiplication accelerator for HPC/AI application, which comprises an HPC core, a stepping accelerator core MZ, a high-bandwidth memory HBM, a global shared memory GSM and a data bus, wherein the stepping accelerator core MZ comprises a systolic array SA, a B buffer and a C buffer, the B buffer is used for buffering a matrix B required by matrix multiplication in the input systolic array SA, the C buffer is used for buffering a matrix C required by matrix multiplication in the input systolic array SA, and the global shared memory GSM comprises an A buffer used for storing a matrix A required by matrix multiplication in the systolic array SA. The invention can save storage resources on the premise of not influencing the performance, can meet the bandwidth requirement of the systolic array, can complete high-performance GEMM tasks with the original HPC core at the same time, does not change the ecology of the original HPC core, and is compatible with the original matrix function library.

Description

Incremental matrix multiplication accelerator applied to HPC/AI

Technical Field

The invention relates to a High Performance Computing (HPC) and Artificial Intelligence (AI) operation technology, in particular to an incremental matrix multiplication accelerator applied to HPC/AI.

Background

Since the advent of deep learning, in combination with the traditional High Performance Computing requirements of data centers, the convergence of High Performance Computing (HPC) and Artificial Intelligence (AI) Computing power has been a trend, both from a cost and a Computing power perspective. Research has shown that the cores of high-performance computing and artificial intelligence operations, both of which are General Matrix Multiplication (GEMM), differ only in precision. For example, core operations of a Convolutional layer and a fully-connected layer, in which a typical Convolutional Neural Network (CNN) in the AI algorithm occupies the maximum computation requirement, can be obtained by matrix multiplication through conversion of im2col or batch operation. The core operation of the main stream model transmomer of Natural Language Processing (NLP) is also matrix multiplication. To improve the performance of HPC and AI computation, it is one of the mainstream technical approaches to add an acceleration core that optimizes GEMM operations in the HPC chip. Because of the relative simplicity of data flow control, systolic arrays are one of the most efficient structures for performing matrix multiplication and are widely used.

The structure of the pulsation Array (SA, systolic Array) is shown in FIG. 1. Consists of an array of r rows and c columns of Processing Elements (PEs). Each PE is typically a multiply-add (MAC) device capable of pipelined completion of multiply-add operations. The GEMM calculation form of the systolic array is Y = AB + C. In AI calculation, the matrix A, B represents the weight and the characteristic diagram matrix, respectively, C represents the offset matrix, and Y represents the calculation result. So it is usually necessary to set A, B, C three buffers, the C buffer is used to store the initial shift matrix C and to compute intermediate results. The systolic array generally has three data stream formats, weight fixing (WS, weight state), input fixing (IS), and Output fixing (OS), and the method respectively and correspondingly preloads a and the matrix B, C to PE. Only the IS and WS systolic arrays are discussed herein. The pre-loading matrix is a fixed matrix, the multiplied other matrix is a dynamic matrix, and the dynamic matrix is sequentially and dynamically loaded into the adjacent PE for calculation according to the row direction.

A typical systolic array-based processor is a Google Tensor Processor (TPU), which is structured as shown in FIG. 2, fabricated using a 28nm process and has a final area less than330um ² . The systolic array in the TPU comprises 256 × 256=65536 PEs, although the number of PEs is large, the systolic array actually only occupies one fourth of the area on a chip, and the area of a calculation control unit is not less than one half of the area on the chip. Thus, the on-chip storage of the TPU takes up approximately one third of the area.

The background herein is to design an incremental systolic array accelerator core in a chip that already has high performance computing power. The basic structure of the HPC chip comprises resources such as an HPC core, an interconnection bus, a shared memory and the like, so that the area budget of the increment acceleration core is limited, GEMM (fp 16, fp 32) under AI application is supported, meanwhile, the GEMM (fp 64) is supported to be completed together with the conventional HPC core, the original system ecology is not changed, and the key technical problem to be solved urgently is formed.

Disclosure of Invention

The technical problems to be solved by the invention are as follows: aiming at the problems in the prior art, the invention provides an incremental matrix multiplication accelerator applied to HPC/AI, which can save storage resources on the premise of not influencing the performance, can meet the bandwidth requirement of a ripple array, can complete high-performance GEMM tasks with an original HPC core at the same time, does not change the ecology of the original HPC core, and is compatible with an original matrix function library.

In order to solve the technical problems, the invention adopts the technical scheme that:

an incremental matrix multiplication accelerator for HPC/AI applications comprises an HPC core, a stepping accelerator core MZ, a high-bandwidth storage HBM, a global shared storage GSM and a data bus, wherein the HPC core, the stepping accelerator core MZ, the high-bandwidth storage HBM and the global shared storage GSM are respectively connected with the data bus, the stepping accelerator core MZ comprises a systolic array SA, a B buffer and a C buffer, the systolic array SA comprises a plurality of rows and a plurality of columns of processing units PE arranged in a grid shape, the B buffer is used for buffering a matrix B required by matrix multiplication in an input systolic array SA, the C buffer is used for buffering a matrix C required by matrix multiplication in the input systolic array SA, the global shared storage GSM comprises an A buffer used for storing the matrix A required by matrix multiplication in the HPC SA, the stepping accelerator core MZ independently uses the matrix A required by matrix multiplication in the systolic array SA stored in the A buffer when the HPC core and the stepping accelerator core MZ are both in operation, and the data bus uses the buffer A required by the storage matrix A in the systolic array SA when the HPC core and the stepping accelerator core MZ are both in operation.

Optionally, the computation mode executed by the meizoaccelerator core MZ when the HPC core and the meizoaccelerator core MZ are both working is a matrix multiplication computation task GEMM in the HPC task; the computation mode executed by the accelerator core MZ is called convolution computation task CONV or full-connection computation task FC in the AI task when the HPC core is not operating.

Alternatively, when the computation mode executed by the stepping accelerator core MZ is the matrix multiplication computation task GEMM in the HPC task, the computation mode executed by the stepping accelerator core MZ is Y ^T ＝B ^T A ^T +C ^T Wherein A buffers are used for the buffer matrix A, and B buffers are used for the transpose matrix B of the buffer matrix B ^T C buffering the transposed matrix C used for the buffer matrix C ^T Or the transpose matrix Y of the finally obtained matrix Y ^T And when the stepping accelerator core MZ reads the matrix from the A buffer, the B buffer and the C buffer, the matrix A read from the A buffer is executed with the transposition operation to obtain the transposition matrix A of the matrix A ^T 。

Optionally, when the computation mode executed by the meiotion accelerator core MZ is the computation task GEMM of matrix multiplication in the HPC task, the computation mode includes that the transposed matrix a of the matrix a is used ^T Pre-loaded as a fixed matrix into each processing element PE of the systolic array SA, and transposing the matrix B of the matrix B ^T Transposed matrix C of matrix C ^T The on-chip buffer of the systolic array SA is stored as a dynamic matrix to participate in the computation.

Optionally, when the computation mode executed by the stepping accelerator core MZ is the convolution computation task CONV in the AI task, the computation mode executed by the stepping accelerator core MZ is Y = AB + C, where a buffer is used for the buffer matrix a, B buffer is used for the buffer matrix B, C buffer is used for the buffer matrix C or the finally obtained matrix Y, and no transposition operation is executed when the stepping accelerator core MZ reads the matrix from the a buffer, the B buffer, and the C buffer.

Optionally, when the computation mode executed by the meizotic accelerator core MZ is the convolution computation task CONV in the AI task, the method includes pre-loading the matrix a as a fixed matrix into each processing unit PE of the systolic array SA, and storing the matrix B and the matrix C in an on-chip cache of the systolic array SA as dynamic matrices to participate in computation.

Optionally, when the computation mode executed by the stepping accelerator core MZ is the fully-connected computation task FC in the AI task, the computation mode executed by the stepping accelerator core MZ is Y = AB ^T + C, where A buffers for buffer matrix A and B buffers for buffer matrix B or transposed matrix B of matrix B ^T The C buffer is used for buffering the matrix C or the finally obtained matrix Y, the transposition operation is not executed when the stepping accelerator core MZ reads the matrix from the A buffer and the C buffer, and the transposition operation is not executed or executed when the matrix is read from the B buffer.

Optionally, when the computation mode executed by the meizotic accelerator core MZ is the fully-connected computation task FC in the AI task, the method includes that the matrix B or the transpose matrix B of the matrix B is used ^T The matrix A and the matrix C are stored in an on-chip cache of the systolic array SA to be used as dynamic matrixes to participate in calculation.

Optionally, the matrix B or a transposed matrix B of the matrix B is described ^T When the matrix B is preloaded as the fixed matrix into each processing element PE of the systolic array SA, if the matrix B is preloaded as the fixed matrix into each processing element PE of the systolic array SA, the process of loading the fixed matrix includes: automatically blocking the fixed matrix according to a set size, completing and setting a mask position to be 0 if the size of the tail-end block is insufficient, so that the data does not participate in calculation, then injecting all blocks into the systolic array SA from the right side of the systolic array SA according to a blocking sequence to realize invisible transposition, and finishing the loading of the fixed matrix; if it is a transposed matrix B of the matrix B ^T Pre-loading the fixed matrix into each processing element PE of the systolic array SA as a fixed matrix, the process of loading the fixed matrix includes: to fixed matrix according to set largeAnd (3) small automatic blocking, completing and setting the mask position to 0 to enable the data not to participate in calculation if the size of the tail-end block is insufficient, and then injecting the block into the systolic array SA from the upper side of the systolic array SA according to the blocking sequence to finish the loading of the fixed matrix.

Optionally, the incremental matrix multiplier accelerator further comprises a configuration bus, and the HPC core and the meizoic accelerator core MZ are respectively connected to the configuration bus.

Compared with the prior art, the invention mainly has the following advantages:

the incremental matrix multiplication accelerator comprises an HPC core, a stepping accelerator core MZ, a high-bandwidth storage HBM, a global shared storage GSM and a data bus, wherein the HPC core, the stepping accelerator core MZ, the high-bandwidth storage HBM and the global shared storage GSM are respectively connected with the data bus, the stepping accelerator core MZ comprises a pulsation array SA, a B buffer and a C buffer, the pulsation array SA comprises a plurality of rows and a plurality of columns of processing units PE arranged in a grid shape, the B buffer is used for buffering a matrix B required by matrix multiplication in an input pulsation array SA, the C buffer is used for buffering a matrix C required by matrix multiplication in the input pulsation array SA, the global shared storage GSM comprises an A buffer used for storing the matrix A required by matrix multiplication in the pulsation array SA, the stepping accelerator core independently uses the matrix A required by matrix multiplication in the pulsation array SA stored in the A buffer when the HPC core is not in operation, and the stepping accelerator core MZ shares the matrix A required by the matrix A in the array in the A buffer when the HPC core and the stepping accelerator core are both in operation. The invention can integrate HPC and AI acceleration, is embedded into a system environment of an HPC core, shares High Bandwidth Memory (HBM) and Global Shared Memory (GSM) with the HPC core, and is matched with the HPC core to complete matrix multiplication GEMM acceleration, and meanwhile, the MZ core (stepping accelerator core) can also complete acceleration of AI tasks. By utilizing the insensitivity of the systolic array to the fixed matrix bandwidth, a half storage structure is realized in a meizoic accelerator core, and different from a structure facing the systolic array, three matrix buffer storages A, B, C/Y matrix data are needed. GSM on HPC core bus and B buffer and C buffer on stepping accelerator core sheet are used to form half storage structure, which can effectively reduce area overhead of on-sheet storage and increase density of calculation unit of stepping accelerator core.

Drawings

Fig. 1 is a diagram showing a configuration of a systolic array SA in the related art.

Fig. 2 is a diagram of a TPU structure in the prior art.

Fig. 3 illustrates the principle of weight fixation and input fixation data flow patterns involved in an embodiment of the present invention.

Fig. 4 is an example of the principle of computation time of a systolic array in an embodiment of the invention.

FIG. 5 is a graph illustrating the effect of dynamic matrix bandwidth on performance according to an embodiment of the present invention.

FIG. 6 is a graph illustrating the effect of fixed matrix bandwidth on performance according to an embodiment of the present invention.

Fig. 7 is a schematic structural diagram of an incremental matrix multiplier accelerator according to an embodiment of the present invention.

FIG. 8 is a schematic diagram of a top-injection immobilization matrix in an embodiment of the present invention.

FIG. 9 is a schematic diagram of a right side injection fixing matrix in an embodiment of the present invention.

FIG. 10 is a flowchart illustrating operation of the oscillation accelerator core MZ according to an embodiment of the present invention.

FIG. 11 is a diagram illustrating automatic blocking of a fixed matrix according to an embodiment of the present invention.

FIG. 12 is a diagram illustrating automatic partitioning of a dynamic matrix according to an embodiment of the present invention.

Detailed Description

Before describing the incremental matrix multiplier accelerator for HPC/AI applications of the present invention, the memory access bandwidth requirements of the systolic array when executing GEMM are first analyzed. Fig. 3 shows the weight-fixed (WS) and input-fixed (IS) data flow for a systolic array. Whether the data flow mode IS input fixed (IS) or weight fixed (WS), three banks are required to provide data for GEMM computation, and the computation results of the systolic array can be written back into the offset matrix/partial sum buffer. The matrix needs to be partitioned and injected into the systolic array according to the size of the systolic array, i.e. the number of processing elements PE in the systolic array. The fixed matrix is only loaded once in one calculation process, and the other two matrixes are dynamic matrixes and need to be injected into the systolic array at the same time. The bandwidth requirement of each memory bank is matched with the size of the systolic array, and particularly the dynamic matrix needs to provide data to the systolic array in time every cycle so as to keep high resource utilization rate. Then, modeling analysis is carried out: the systolic array IS set to r rows and c columns and each processing element PE requires x beats to compute the result using the IS data flow mode. Y = AB + C the calculated mapping at the input fixed (IS) data stream mode IS Y = WI + Bias, where W IS the weight, I IS the input, and Bias IS the Bias/partial sum. The size of the matrix A is m rows and k columns, the size of the matrix B is k rows and n columns, the size of the matrix C is m rows and n columns, alpha is the bandwidth utilization rate of the matrix A, beta is the bandwidth utilization rate of the matrix B, and the matrix C and the matrix A are required to be injected into the systolic array simultaneously. Fig. 4 is a graph for calculating the time Y = AB + C, and therefore the calculation time is obtained according to the bandwidth utilization, as shown in equation (1). Equation (2) represents the ratio between the optimum performance and the test performance on the same calculation scale.

In the above formula, t (α, β) is the calculation time, r is the row number of the systolic array, β is the bandwidth utilization rate of the matrix B, k is the row number of the matrix B, n is the column number of the matrix B, x is the beat number, c is the column number of the systolic array, m is the row number of the matrix a, α is the bandwidth utilization rate of the matrix a, pref _ is the ratio between the optimum performance and the test performance, and t (1.0) is the calculation time when both α and β take on the value of 1.0. And (3) analyzing the influence of different bandwidths and bandwidth utilization rates on the performance based on the theoretical models (1) and (2). In the mezzanine accelerator core, the computational core is a systolic array of 16 rows and 16 columns of processing units PE, so we relate the variables to reality when analyzing performance, when r = c =16, x =1, m = k = n =16 i (i =1,2,3 …, 10), α and β change from 10% to 100%, respectively. Fig. 5 and 6 show the performance variation for different matrix sizes (different curves) and different bandwidth utilization (abscissa variable). In fig. 5, as the size of matrix a increases, the impact of matrix a bandwidth on performance gradually increases and becomes proportional. In fig. 6, as the size of matrix B increases, the impact of matrix B bandwidth on performance gradually decreases, eventually becoming almost negligible. Thus, the GEMM performance of a systolic array is extremely sensitive to the bandwidth of the dynamic matrix, but not to the fixed matrix.

As shown in fig. 7, the incremental matrix multiplier accelerator applied to HPC/AI in this embodiment includes an HPC core, a marching accelerator core MZ, a high bandwidth storage HBM, a global shared storage GSM, and a data bus, where the HPC core, the marching accelerator core MZ, the high bandwidth storage HBM (high bandwidth memory), and the global shared storage GSM (global shared memory) are respectively connected to the data bus, the marching accelerator core MZ includes a systolic array SA, a B buffer, and a C buffer, the systolic array SA includes a plurality of rows and a plurality of columns of processing units PE arranged in a grid, the B buffer is used for buffering a matrix B required for matrix multiplication in the systolic array SA, the C buffer is used for buffering a matrix C required for matrix multiplication in the systolic array SA, the shared storage GSM includes an a buffer for storing a matrix a required for matrix multiplication in the systolic array SA, and the marching accelerator core MZ independently uses the matrix a required for matrix multiplication in the systolic array SA when the HPC core MZ is not operating, and the systolic accelerator core a buffer is used in the working matrix a working accelerator core. The way memory is used by GEMM in HPC tasks and AI tasks is different. For the GEMM in the AI task, the stepping accelerator core independently completes the calculation task without memory access conflict. For the GEMM in the HPC task, the HPC core and the marching accelerator core share the matrix A data in GSM, and when the matrix A data is in GSM, the HPC core and the marching accelerator core have access conflict.

In this embodiment, when the HPC core and the meizoic accelerator core MZ both work, the computation mode executed by the meizoic accelerator core MZ is a matrix multiplication computation task GEMM in the HPC task; the computation mode executed by the accelerator core MZ is called convolution computation task CONV or full-connection computation task FC in the AI task when the HPC core is not operating, as shown in table 1.

Table 1: and (4) taking a calculation mode table of an accelerator core MZ.

Referring to Table 1, when the computation mode executed by the Mach Accelerator core MZ is the matrix multiplication computation task GEMM in the HPC task, the computation mode executed by the Mach Accelerator core MZ is Y ^T ＝B ^T A ^T +C ^T Wherein A buffers are used for the buffer matrix A, and B buffers are used for the transpose matrix B of the buffer matrix B ^T C buffering the transposed matrix C used for the buffer matrix C ^T Or the transpose matrix Y of the finally obtained matrix Y ^T And when the stepping accelerator core MZ reads the matrix from the A buffer, the B buffer and the C buffer, the matrix A read from the A buffer is executed with the transposition operation to obtain the transposition matrix A of the matrix A ^T . In this embodiment, when the computation mode executed by the michael accelerator core MZ is the computation task GEMM of matrix multiplication in the HPC task, the method includes the step of transforming the transposed matrix a of the matrix a into the matrix b ^T Pre-loaded as a fixed matrix into each processing element PE of the systolic array SA, and transposing the matrix B of the matrix B ^T Transposed matrix C of matrix C ^T The on-chip buffer of the systolic array SA is stored as a dynamic matrix to participate in the computation. The way GEMM uses memory in HPC tasks and AI tasks is different. For the GEMM in the AI task, the stepping accelerator core independently completes the calculation task without memory access conflict. For GEMM in the HPC task, the HPC core shares matrix A data in GSM with the strided accelerator core, and when the matrix A data is in GSM, the HPC core and strided accelerator core have access conflicts. To compensate for the latency of accessing the global shared memory GSM through the bus, the fixed matrix data may be stored at the global shared memory GSM, since the systolic array is not sensitive to the bandwidth of the fixed matrix, which is sufficient to ensure the throughput of the systolic array. Storing in global shared memory GSM for resolving memory conflicts and for resolving delaysThe two problems of the fixed matrix storage are solved, a scheme of fixed matrix dual-mode loading is provided, and the invisible transposition of the matrix is realized. Y = AB + C can be converted into Y by using the matrix calculation principle ^T ＝B ^T A ^T +C ^T A is ^T The matrix is used as a fixed matrix of a pulse array in a stepping accelerator core, so that the bandwidth requirement is reduced, and the ecological environment of the whole system is not influenced. According to the matrix calculation principle, Y = AB + C can be converted into Y ^T ＝B ^T A ^T +C ^T ，Y ^T 、B ^T 、A ^T 、C ^T Respectively, represent transposed matrices of the Y, B, A, C matrix. Comparing the two calculation formulas, it can be seen that the positions of the matrix A, B are exchanged in the transposed formula, so that the positions of the injection of the matrix A, B into the systolic array are changed correspondingly. If the original formula IS calculated in the systolic array, according to the IS data stream format, the matrix B IS a fixed matrix, the matrix A IS a dynamic matrix, the bandwidth requirement of the matrix A IS high, but the matrix A also exists in a global shared memory GSM and IS shared with an HPC core, and the access conflict IS high. And after transposing A ^T The method is changed into a fixed matrix, the multiplexing rate is high, the bandwidth requirement is low, and the access conflict can be reduced by taking A from the global shared storage GSM. Since the global shared memory GSM is shared by the HPC core and the meizotic accelerator core MZ, and cannot destroy the ecological environment of the original system, the transpose matrix cannot be stored in the global shared memory GSM, and it is necessary to implement the transpose in the transmission process after the original matrix is taken out. After analyzing the structural characteristics of the systolic array, as shown in fig. 8, it can be found that the first row of the systolic array in the systolic accelerator core Z is located at the lowest part, the fixed matrix is injected from the first row downwards, the first row is injected to the lowest part of the systolic array, the inversion of the matrix is integrally realized, but the left and right order of the matrix is not changed. And injecting the dynamic matrix from the left side after the injection of the fixed matrix is finished, and calculating. If the fixed matrix is injected into the systolic array from the right side and the matrices are injected in columns, as shown in fig. 9, the fixed matrix in the systolic array realizes a transposition function compared with the fixed matrix injected from the upper side. According to the scheme, the implicit transposition realized by the dynamic pulse array is realized under the condition of storing according to the original format by injecting data, and the original system is prevented from being damagedThe storage environment for storing GSM is globally shared. When executing GEMM in HPC task, the computation scheme is Y ^T ＝B ^T A ^T +C ^T Stepping accelerator core utilizes implicit transposition to convert A ^T The matrix is used as a fixed matrix to be injected into the systolic array, the bandwidth insensitivity of the fixed matrix of the systolic array is utilized, access conflict is reduced, and the HPC core and the stepping accelerator core can jointly complete the HPC task. HPC core and stepping accelerator core share data of matrix A in global shared memory GSM, B ^T 、C ^T The matrix is stored in transposed form on the on-chip private B and C buffers, and the generated matrix Y can be stored in the high bandwidth storage HBM or the global shared storage GSM.

The stepping accelerator core does not need to be matched with the HPC core when executing the AI task, namely data is not shared.

Referring to table 1, when the computation mode executed by the stepping accelerator core MZ is the convolution computation task CONV in the AI task, the computation mode executed by the stepping accelerator core MZ is Y = AB + C, where a buffer is used for caching matrix a, B buffer is used for caching matrix B, C buffer is used for buffering matrix C or the finally obtained matrix Y, and no transposition operation is executed when the stepping accelerator core MZ reads the matrix from the a buffer, the B buffer, and the C buffer. When the calculation mode executed by the advancing accelerator core MZ is the convolution calculation task CONV in the AI task, the method comprises the steps of taking the matrix A as a fixed matrix to be pre-loaded into each processing unit PE of the systolic array SA, and storing the matrix B and the matrix C into an on-chip cache of the systolic array SA to be taken as a dynamic matrix to participate in calculation. When the convolution calculation task is executed, the multichannel feature map is converted into a B matrix by utilizing an im2col algorithm, and the dimension of the convolution operation is reduced to be matrix multiplication calculation GEMM. The matrix A is convolution kernel stored in GSM, because HPC kernel does not execute convolution operation, so there is no access conflict, the input characteristic diagram is stored in an on-chip special B buffer, the deviation is stored in an on-chip special C buffer, and the generated matrix Y can be stored in high bandwidth storage HBM or global shared storage GSM.

Referring to table 1, when the computation mode executed by the meizotic accelerator core MZ is the fully-connected computation task FC in the AI task, the computation mode executed by the meizotic accelerator core MZ isY＝AB ^T + C, where A buffers for buffer matrix A and B buffers for buffer matrix B or transpose matrix B of matrix B ^T The C buffer is used for buffering the matrix C or the finally obtained matrix Y, the transposition operation is not executed when the stepping accelerator core MZ reads the matrix from the A buffer and the C buffer, and the transposition operation is not executed or executed when the matrix is read from the B buffer. Referring to table 1, when the computation mode executed by the meizoaccelerator core MZ is the fully-connected computation task FC in the AI task, the computation mode includes that the matrix B or the transposed matrix B of the matrix B is used ^T The matrix A and the matrix C are stored in an on-chip cache of the systolic array SA to be used as dynamic matrixes to participate in calculation. When executing the full-connection calculation task FC, splicing the vectors into a matrix diagram by using batch operation, and converting matrix vector multiplication into a matrix multiplication calculation task GEMM, wherein the calculation scheme is Y = AB ^T + C. The input splicing diagram is stored in an on-chip B buffer, and the B buffer is proprietary on a chip, so that the BT matrix can be directly stored in the on-chip buffer, and the transposition can be realized by utilizing invisible transposition when data is injected. The matrix A is stored in GSM for the mosaic, because HPC core does not execute full connection operation, so there is no access conflict, the offset is stored in the on-chip proprietary C buffer, and the generated Y matrix can be stored in the high bandwidth storage HBM or the global shared storage GSM.

In this embodiment, the matrix B or the transposed matrix B of the matrix B ^T When the matrix B is preloaded as the fixed matrix into each processing element PE of the systolic array SA, if the matrix B is preloaded as the fixed matrix into each processing element PE of the systolic array SA, the process of loading the fixed matrix includes: automatically blocking the fixed matrix according to a set size, completing and setting a mask position to be 0 if the size of the tail-end block is insufficient, so that the data does not participate in calculation, then injecting all blocks into the systolic array SA from the right side of the systolic array SA according to a blocking sequence to realize invisible transposition, and finishing the loading of the fixed matrix; if it is a transposed matrix B of the matrix B ^T Pre-loading the fixed matrix into each processing element PE of the systolic array SA as a fixed matrix, the process of loading the fixed matrix includes: to a fixed matrix as followsAnd automatically partitioning the fixed matrix, completing and setting the mask position to 0 if the size of the tail-end partition is insufficient so that the data does not participate in calculation, and injecting the partitions into the systolic array SA from the upper side of the systolic array SA according to the partitioning sequence to finish the loading of the fixed matrix.

As shown in fig. 7, the incremental matrix multiplier accelerator of this embodiment further includes a configuration bus, and the HPC core and the stepping accelerator core MZ are respectively connected to the configuration bus, so as to perform parameter configuration on the HPC core and the stepping accelerator core MZ.

In this embodiment, it is not considered that the matrix data is injected into the systolic array from different directions as dual-mode loading. The half-memory architecture is combined with the half-memory architecture, so that the problem that a stepping accelerator core can complete HPC tasks with the HPC core at the same time is solved, the on-chip memory area is reduced, and the density of the on-chip computing units of the stepping accelerator core is improved because the stepping accelerator core shares GSM and one on-chip memory unit is reduced.

As shown in fig. 10, the work flow of the oscillation accelerator core MZ in this embodiment includes:

s1, judging a calculation mode, and if the calculation mode is a matrix multiplication calculation task GEMM, entering a step S2; if the calculation mode is a convolution operation calculation task CONV, the step S3 is executed; if the calculation mode is the full connection calculation task FC, the process may proceed to step S4.

S2, calculating the mapping scheme as Y ^T ＝B ^T A ^T +C ^T Since the global shared memory GSM shares with the HPC core, the matrix A is directly stored in the global shared memory GSM as a fixed matrix, and the matrix B, C is in a transposed form, namely B ^T 、C ^T The matrix is stored in an on-chip cache to serve as a dynamic matrix. The process advances to step S5.

And S3, calculating a mapping scheme of Y = AB + C, independently finishing convolution operation by stepping an accelerator core MZ, storing the matrix A into GSM as a fixed matrix, and storing the matrix B, C into on-chip cache to serve as a dynamic matrix. The process advances to step S6.

S4, calculating a mapping scheme of Y = AB ^T + C, since the stepping accelerator core MZ independently completes the full connection operation, the matrix B can be directly stored in the full connection accelerator core MZThe local shared storage GSM is used as a fixed matrix, and the matrix B can also be transposed into the matrix B ^T And storing the global shared memory GSM as a fixed matrix, and storing the matrix A, C in an on-chip cache as a dynamic matrix. If the matrix B is directly stored in the global shared memory GSM, the step S5 is carried out; if B is ^T If the matrix is stored in the global shared memory GSM, the process goes to step S6.

S5, because the matrix needs to realize transposition, but the invisible transposition needs to be realized according to the storage of the original format. The fixed matrix is automatically blocked in the manner of fig. 9, the block size is 16 × 16, if the end size does not satisfy 16, the data is completed and the MASK (MASK, used to define whether the relevant bit participates in the calculation) position is 0, that is, the data does not participate in the calculation. According to the sequence of the blocks, the blocks are injected into the systolic array from the right side, as shown in fig. 11, the invisible transposition is realized, and the dynamic matrix injection is waited. The process advances to step S7.

And S6, because the matrix does not need to be transposed, the matrix is directly loaded according to the storage format. The fixed matrix is automatically blocked according to the mode of fig. 12, the block size is 16 × 16, if the end size does not satisfy 16, the completion is carried out, and the MASK position is set to 0, that is, the data does not participate in the calculation. In block order, the blocks are injected into the systolic array from above, as shown in FIG. 8, awaiting dynamic matrix injection. The process advances to step S7.

S7, no matter the dynamic matrix needs no transposition, the format conversion is completed in advance and the dynamic matrix is stored in an on-chip for storage, so that extra processing is not needed. The dynamic matrix is automatically blocked according to fig. 10, the block size is 16 × m, if the matrix row direction is less than 16, the dynamic matrix is filled up, and the MASK (MASK, used for limiting whether the relevant bit participates in the calculation) position is 0, that is, the data does not participate in the calculation. According to the block sequence, as shown in fig. 8, the result can be calculated by injecting blocks into the systolic array from the right side. The process advances to step S8.

S8, if the matrix calculation is not finished, the part is added with the C/C ^T And storing the dynamic matrix into an on-chip buffer for waiting for the next injection as the dynamic matrix. If the matrix calculation is finished, the generated matrix Y/Y is processed ^T Into the high bandwidth storage HBM or the global shared storage GSM.

In order to verify the semi-storage structure and the method for loading the fixed matrix in the dual mode, the area synthesis is performed on the stepping accelerator core MZ in the embodiment to obtain the overhead of on-chip computing resources and storage resources, and the optimization of the resource overhead by the method under the condition of not influencing the performance is obtained by comparing the overhead of the computing resources and the overhead of the storage resources in the TPU. To evaluate the method, the meizotic accelerator core MZ is embedded in the system environment of MT-3000 using a bus and evaluated. The Mach accelerator core MZ and MT-3000 share HBM and GSM, on the premise that the original system environment and performance are not affected, the area in the Mach accelerator core MZ is integrated, after a specific result is obtained, area percentage data are formed and compared with TPU, and the result is shown in Table 2.

Table 2: and comparing area overhead.

As can be seen from table 2, since the mezzanine accelerator core MZ shares the global shared memory GSM, the memory cell facing the systolic array on the chip is optimized, and the on-chip memory occupies only 32.00% of the entire mezzanine accelerator core MZ, which is optimized compared with 46.03% occupied by the TPU on-chip memory. Correspondingly, the computational resource density on the chip is greater because the storage overhead is optimized. The systolic array SA, as a computational unit, occupies 58.03% of the MZ area of the mezzanine accelerator core, and has a higher density than 47.63% of the computational resources occupied on the TPU chip, enabling it to perform computational tasks more efficiently.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and the scope of the present invention is not limited to the above embodiments, and all technical solutions that belong to the idea of the present invention belong to the scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims

1. An incremental matrix multiplication accelerator for HPC/AI applications is characterized by comprising an HPC core, a stepping accelerator core MZ, a high-bandwidth storage HBM, a global shared storage GSM and a data bus, wherein the HPC core, the stepping accelerator core MZ, the high-bandwidth storage HBM and the global shared storage GSM are respectively connected with the data bus, the stepping accelerator core MZ comprises a systolic array SA, a B buffer and a C buffer, the systolic array SA comprises a plurality of rows and a plurality of columns of processing units PE arranged in a grid shape, the B buffer is used for buffering a matrix B required for matrix multiplication in an input systolic array SA, the C buffer is used for buffering a matrix C required for matrix multiplication in the input systolic array SA, the global shared storage GSM comprises an A buffer used for storing the matrix A required for matrix multiplication in the systolic array SA, the stepping accelerator core MZ independently uses the matrix A required for matrix multiplication in the systolic array SA when the HPC core and the stepping accelerator core MZ are both in operation, and the accelerator core and the shared accelerator core uses the matrix A required for matrix A in the systolic array SA when the HPC core and the stepping accelerator core and the MZ are both in operation.

2. The incremental matrix multiplier accelerator for HPC/AI applications of claim 1, wherein the computation mode performed by the mezzanine accelerator core MZ when both the HPC core and the mezzanine accelerator core MZ are working is the matrix multiplication computation task GEMM in the HPC task; the computation mode executed by the accelerator core MZ is a convolution computation task CONV or a full-connection computation task FC in the AI task when the HPC core is not in operation.

3. The HPC/AI application specific incremental matrix multiplier accelerator of claim 2, wherein the meizoic accelerator core MZ performs a computation Y when the computation mode performed by the meizoic accelerator core MZ is the matrix multiplication computation task GEMM in the HPC task ^T ＝B ^T A ^T +C ^T Wherein A buffers are used for the buffer matrix A, and B buffers are used for the transpose matrix B of the buffer matrix B ^T C buffering the transposed matrix C used for the buffer matrix C ^T Or the transpose matrix Y of the finally obtained matrix Y ^T And the MZ core of the stepping accelerator is buffered from A and BAnd C, when reading the matrix in the buffer, performing transposition operation on the matrix A read in the buffer A to obtain a transposed matrix A of the matrix A ^T 。

4. The HPC/AI-oriented application of claim 3, wherein the computing mode performed by the Mac Accelerator core MZ is a GEMM matrix multiplication task in an HPC task comprising transposing A matrix A of A matrix ^T Pre-loaded as a fixed matrix into each processing element PE of the systolic array SA, and transposing the matrix B of the matrix B ^T Transposed matrix C of matrix C ^T The on-chip buffer of the systolic array SA is stored as a dynamic matrix to participate in the computation.

5. The HPC/AI oriented incremental matrix multiplier accelerator of claim 2, wherein when the computation mode performed by the stepping accelerator core MZ is the convolution computation task CONV in the AI task, the calculation performed by the stepping accelerator core MZ is Y = AB + C, where a buffer is used for the buffer matrix a, B buffer is used for the buffer matrix B, C buffer is used for the buffer matrix C or the finally obtained matrix Y, and no transposition operation is performed when the stepping accelerator core MZ reads the matrix from the a buffer, the B buffer, and the C buffer.

6. The HPC/AI application specific incremental matrix multiplier accelerator of claim 5, wherein the computation mode performed by the meiotic accelerator kernel MZ is a convolution computation task CONV of the AI task, comprising preloading a matrix A as a fixed matrix into each processing unit PE of the systolic array SA, and storing matrices B and C in an on-chip buffer of the systolic array SA as dynamic matrices to participate in the computation.

7. The HPC/AI-oriented incremental matrix multiplier accelerator of claim 2, wherein when the computation mode performed by the meizotic accelerator core MZ is the fully-connected computation task FC in the AI task, the meizotic accelerator core MZ performs the computation mode of Y = AB ^T + C, wherein A is buffered for bufferingStoring matrix A, B buffering transposed matrix B for buffer B or B ^T The C buffer is used for buffering the matrix C or the finally obtained matrix Y, the transposition operation is not executed when the stepping accelerator core MZ reads the matrix from the A buffer and the C buffer, and the transposition operation is not executed or executed when the matrix is read from the B buffer.

8. The HPC/AI application specific incremental matrix multiplier accelerator of claim 7, wherein the computing mode performed by the MAI accelerator core MZ is that of a fully connected computing task FC in an AI task, comprising transposing matrix B to matrix B or matrix B ^T The matrix A and the matrix C are stored in an on-chip cache of the systolic array SA to be used as dynamic matrixes to participate in calculation.

9. The incremental matrix multiplier accelerator for HPC/AI applications according to claim 8, wherein the matrix B or the transpose B of matrix B is used ^T When the matrix B is preloaded as the fixed matrix into each processing element PE of the systolic array SA, if the matrix B is preloaded as the fixed matrix into each processing element PE of the systolic array SA, the process of loading the fixed matrix includes: automatically blocking the fixed matrix according to a set size, completing and setting a mask position to be 0 if the size of the tail-end block is insufficient, so that the data does not participate in calculation, then injecting all blocks into the systolic array SA from the right side of the systolic array SA according to a blocking sequence to realize invisible transposition, and finishing the loading of the fixed matrix; if it is the transpose matrix B of the matrix B ^T Pre-loading the fixed matrix into each processing element PE of the systolic array SA as a fixed matrix, the process of loading the fixed matrix includes: and automatically blocking the fixed matrix according to the set size, completing and setting the mask position to be 0 if the size of the block at the tail end is insufficient so that the data does not participate in calculation, and then injecting the blocks into the systolic array SA from the upper side of the systolic array SA according to the block sequence to finish the loading of the fixed matrix.

10. The incremental matrix multiplier accelerator for HPC/AI applications of claim 9, further comprising a configuration bus to which the HPC core and the meizoic accelerator core MZ are connected, respectively.