WO2019205617A1

WO2019205617A1 - Calculation method and apparatus for matrix multiplication

Info

Publication number: WO2019205617A1
Application number: PCT/CN2018/117559
Authority: WO
Inventors: 方民权; 吴小蓉; 程剑
Original assignee: 华为技术有限公司
Priority date: 2018-04-26
Filing date: 2018-11-27
Publication date: 2019-10-31
Also published as: CN110415157A; CN110415157B

Abstract

A matrix multiplier (400). The fully connected network included in an existing matrix multiplier occupies a large chip space, and a great amount of memory access is required during the calculation of the matrix multiplier, thus causing low efficiency of matrix multiplication calculation performed by a streaming multiprocessor. On the basis of the purpose of improving the efficiency of matrix multiplication calculation performed by a graphics processing unit, during matrix multiplication, according to the characteristic that different groups of repositories can be accessed simultaneously, the matrix multiplier (400) each time loads a row of elements of a matrix as a multiplicand and a column of elements of the matrix as a multiplier to a corresponding calculation unit, and performs calculation. By use of the matrix multiplier (400), the steps required by implementation of the matrix multiplication calculation can be decreased, and the number of times of memory access required is reduced, thereby improving the efficiency of matrix multiplication calculation performed by the graphics processing unit.

Description

Method and device for calculating matrix multiplication

Technical field

The present invention relates to the field of graphics technologies, and in particular, to a technical field of matrix multiplication calculation.

Background technique

A graphics processor (English name: Graphics Processing Unit, abbreviation: GPU) is a microprocessor used for image computing operations on devices such as a host computer. In the GPU, the stream multiprocessor (English name: Streaming Multiprocessor, abbreviation: SM) is a basic computing unit, which adopts a single instruction multi-thread execution mode, and can ensure simultaneous execution of multiple threads. In general, SM includes instruction cache (English: Instruction Buffer), thread bundle scheduler (English: Warp Scheduler), instruction distribution unit (English: Dispatch Unit), stream processor (English full name: Streaming Processor, abbreviation: SP) Double precision floating point unit (English full name: Double precision floating-point unit, abbreviation: DP) and other units.

Matrix multiplication is one of the most important operations in data calculation when the GPU performs image processing, and has many applications. For example, in the structure of deep learning, convolutional neural networks can give better results in image and speech recognition, and have excellent performance for large-scale image processing. In the specific implementation of some convolutional neural networks, The convolution calculation can be converted into a matrix multiplication calculation, the convolution kernel matrix and the input image matrix are transformed into two large matrices A and B, and then A and B are multiplied to obtain a result matrix D. Wherein each row of the result matrix D represents an output image whose number of output images is equal to the number of rows of the result matrix D.

Matrix, an important basic concept in mathematics, an M*N matrix is a rectangular array of elements arranged in M rows and N columns. For matrix multiplication, it can only be done if the number of columns of the first matrix as the multiplicand is the same as the number of rows of the second matrix as the multiplier. The calculation rule of matrix multiplication is that each element of the first row of the first matrix is multiplied by the element corresponding to the first column of the second matrix, and then the products are added together as the first column of the first row of the result matrix. element. By analogy, the elements of the Kth column of the Jth row of the result matrix are equal to the sum of the product of each element of the Jth row of the first matrix and the Kth column of the second matrix. The calculation rule of matrix addition is relatively simple, that is, the elements at the same position of the two matrices to be added are added as the elements at the position of the result matrix, thereby obtaining the result matrix.

Correspondingly, for the SM in the GPU, the matrix multiplier is an important component. It is the GPU that relies on various algorithms to perform matrix multiplication operations. At present, the SM performs matrix multiplication operations in the GPU, which requires a large amount of The chip space and the need for a large amount of memory access, resulting in SM matrix multiplication calculations are less efficient.

Summary of the invention

Embodiments of the present application provide a matrix multiplier that can improve the efficiency of matrix multiplication calculations.

In a first aspect, the present application provides a matrix multiplier comprising N*N computing units, the N*N computing units form a matrix of N*N, and N is a positive integer greater than or equal to 2. The matrix multiplier further includes two repository sets, each repository set includes N repositories, the first repository set is used to store a first multiplication matrix in the input matrix, and the second repository set is used to store an input matrix a second multiplication matrix in which the N repositories in the first repository set are connected to the N*N matrix by way of row connection, and the Mth repository in the first repository set and the N*N Each of the computing units of the Mth row in the matrix is connected, and the N repositories in the second repository set are connected to the N*N matrix by a column connection, and the Mth storage in the second matrix set The library is connected to each calculation unit of the Mth column in the matrix of N*N, where M is a variable and takes a value of 1 ≤ M ≤ N. At each clock cycle, each computational unit of each row in the matrix of N*N is used to receive a first input data, a matrix of N*N, broadcast by a repository in a first repository set connected to itself. Each of the computing units of each of the columns is configured to receive second input data broadcast by a repository in a second repository set connected to itself; each calculation in a matrix of N*Ns per clock cycle The unit performs multiplication calculation according to the received first input data and second input data; after the end of the Nth clock period, the matrix multiplier performs multiplication of the first multiplication matrix and the second multiplication matrix.

In the above scheme, the matrix multiplier uses the characteristics that different groups of banks can simultaneously access when performing matrix multiplication, and each time a row element of a matrix as a multiplicand and a column element of a matrix as a multiplier are loaded to corresponding In the calculation unit, the calculation is performed at the same time, thereby reducing the steps required to complete the matrix multiplication operation, reducing the number of memory accesses required, thereby improving the efficiency of the matrix processor for matrix multiplication calculation.

For the first aspect, a possible implementation manner is that, in each clock cycle, all the computing units in the same row in the matrix of N*N receive the same first input data, and the matrix of the N*N is located. All computing units in the same column receive the same second input data. By doing this, the efficiency of the matrix processor for matrix multiplication calculation can be improved.

For the first aspect, another possible implementation manner is that the matrix multiplier further includes a third repository set, where the third repository set is used to store a result matrix, and the N repositories in the third repository The M*N matrix is connected by a column connection, and the Mth repository in the third repository set is connected to each of the Mth columns in the N*N matrix. By doing this, the efficiency of the result matrix of the output matrix multiplication operation can be improved.

For the first aspect, another possible implementation manner is that the matrix multiplier further includes a fourth repository set, where the fourth repository set is used to store an addition matrix in the input matrix, where the fourth repository N repositories are connected to the N*N matrix by way of row connection, and the Mth repository in the fourth repository set is connected to each computing unit of the Mth row in the N*N matrix . In the first clock cycle, each computing unit of the first column in the matrix of N*N is configured to receive a first set of data entered by a repository of a fourth repository set connected to itself, the first set of data For the first column of data in the addition matrix, each of the second columns of the N*N matrix in the second clock cycle is used to receive the bank input in the fourth repository set connected to itself a second set of data, the second set of data is the second column of data in the addition matrix, and so on, in the Nth clock cycle, each computing unit of the Nth column in the matrix of N*N is used for receiving The Nth group of data input by the bank in the fourth repository set connected to itself; in the N+1th clock cycle, each of the N*N matrices is further used to receive according to the addition The input data of the matrix and the multiplication calculation result of the first multiplication matrix and the second multiplication matrix are added to obtain a product calculation result of the first multiplication matrix, the second multiplication matrix, and the addition matrix. Through this algorithm, matrix multipliers can also be used for matrix addition operations.

For the first aspect, another possible implementation manner is that the matrix multiplier further includes a scheduler, configured to obtain a first multiplication matrix and a second multiplication matrix in the form of an N*N matrix, and the first multiplication method The matrix and the second multiplication matrix are respectively stored in the first repository set and the second repository set. By doing so, the matrix that requires matrix multiplication calculation can be divided into multiplication matrices in the form of N*N matrices suitable for the matrix multiplier, thereby improving the efficiency of the matrix multiplier.

In a second aspect, the present application provides a graphics processor comprising the matrix multiplier as described in the first aspect.

In a third aspect, the application provides a system on a chip, the system on a chip comprising the matrix multiplier as described in the first aspect.

In a fourth aspect, the present application provides a calculation method for a matrix multiplier to perform calculation, the matrix multiplier comprising: N*N computing units, the N*N computing units form a matrix of N*N, and N is greater than a positive integer equal to 2; two repository sets, each repository set includes N repositories, a first repository set for storing a first multiplication matrix in the input matrix, and a second repository set for storing an input matrix a second multiplication matrix in which the N repositories in the first repository set are connected to the N*N matrix by way of row connection, and the Mth repository in the first repository set and the N* Each computing unit of the Mth row in the matrix of N is connected, and the N repositories in the second repository set are connected to the matrix of N*N by means of column concatenation, and the number in the second repository set The M repositories are connected to each of the calculation units of the Mth column in the matrix of the N*N, where M is a variable and takes a value of 1 ≤ M ≤ N. The calculation method includes: in a first clock cycle, each of the computing units of each of the N*N matrices receives the first input data broadcast by the repository in the first repository set connected to itself, N Each computing unit of each column in the matrix of *N is configured to receive second input data broadcast by a repository in a second repository set connected to itself, each computing unit in the matrix of the N*N Performing multiplication calculation according to the first input data and the second input data to obtain a first multiplication calculation result, and each calculation unit in the N*N matrix adds the first multiplication calculation result and the initial value in the internal register to obtain the first calculation result. Multiply the result, and save the first multiplication and calculation result calculated by itself in the internal register; in the second clock cycle, each calculation unit of each row in the matrix of N*N receives the connection with itself a first input data of a repository broadcast in a repository set, each computing unit of each column in the matrix of N*N being used to receive a repository broadcast in a second repository set connected to itself Two input data, each calculation unit in the N*N matrix performs multiplication calculation according to the first input data and the second input data to obtain a second multiplication calculation result, and each calculation unit in the matrix of N*N will itself The calculated second multiplication calculation result and the second multiplication and addition calculation result are added to obtain a second multiplication and addition result, and the second multiplication and addition calculation result is saved in the internal register; in the subsequent clock cycle, the calculation is performed by analogy until After the Nth clock cycle, the matrix multiplier performs the multiplication of the first multiplication matrix and the second multiplication matrix.

For the above fourth aspect, a possible implementation manner is that, in each clock cycle, all the computing units in the same row in the matrix of N*N receive the same first input data, and the N*N matrix is located in All computing units in the same column receive the same second input data.

For the fourth aspect, another possible implementation manner is that the matrix multiplier further includes a third repository set, where the third repository set is used to store a result matrix, and the N storages in the third repository set The library is connected to the computing unit in the matrix of the N*N by means of a column connection, and the Mth repository in the third repository set is connected to each computing unit of the Mth column in the matrix of the N*N . The calculation method further includes: each of the calculation units in the N*N matrix outputs the calculated Nth multiplication and addition calculation result to a repository in the third repository set connected to itself.

For the fourth aspect, another possible implementation manner is that the matrix multiplier further includes a fourth repository set, where the fourth repository set is used to store an addition matrix in the input matrix, where the fourth repository N repositories are connected to the N*N matrix by way of row connection, and the Mth repository in the fourth repository set is connected to each calculation unit of the Mth row in the N*N matrix . The calculation method further includes: in the first clock cycle, each computing unit of the first column in the matrix of N*N is configured to receive the first set of data input by the repository of the fourth repository set connected to itself, The first set of data is the first column of data in the addition matrix, and each of the second columns of the N*N matrix in the second clock cycle is used to receive the fourth repository set connected to itself The second set of data input by the repository, the second set of data is the second column of data in the addition matrix, and so on, in the Nth clock cycle, each of the Nth columns in the N*N matrix The computing unit is configured to receive the Nth group of data input by the bank in the fourth repository set connected to itself; in the (N+1)th clock cycle, each of the N*N matrix is further used And performing an addition operation according to the input data of the received addition matrix and the multiplication calculation result of the first multiplication matrix and the second multiplication matrix to obtain a product calculation result of the first multiplication matrix, the second multiplication matrix, and the addition matrix.

DRAWINGS

1 is a schematic diagram of the structure of a repository set in the prior art.

2 is a schematic diagram showing the structure of a matrix multiplier in the prior art.

3 is a schematic diagram of the structure of a computing unit in the prior art.

4 is a schematic diagram showing the structure of a matrix multiplier provided by an embodiment of the present application.

FIG. 5 is a schematic diagram of a structure of a computing unit provided by an embodiment of the present application.

FIG. 6 is a schematic flow chart of an embodiment of the present application.

FIG. 7 is a schematic diagram of an initial state of an embodiment of the present application.

FIG. 8 is a schematic diagram showing the state of the first clock cycle of the embodiment of the present application.

9 is a schematic diagram of a state of a second clock cycle of an embodiment of the present application.

FIG. 10 is a schematic diagram showing the state of the third clock cycle of the embodiment of the present application.

11 is a schematic diagram of a fourth clock cycle state of an embodiment of the present application.

FIG. 12 is a schematic diagram of a fifth clock cycle state of an embodiment of the present application.

FIG. 13 is a schematic structural diagram of a graphics processor provided by an embodiment of the present application.

FIG. 14 is a schematic structural diagram of a system on chip provided by an embodiment of the present application.

FIG. 15 is a schematic structural diagram of a matrix multiplier block provided by an embodiment of the present application.

detailed description

In GPUs, the storage of data is usually organized in the form of a bank. Figure 1 shows a schematic diagram of the structure of a repository set. As shown in FIG. 1, a repository set is composed of a plurality of column storage blocks, and each column storage block is a storage library, wherein each storage block is 32-bit or 64-bit in size. The repository collection is the default row continuation, that is, when assigning a value to the repository collection, consecutive elements are contiguously stored in rows. When the instruction is executed in the SM, the access unit (English full name: Load/Store Units, abbreviation: LD/ST) loads the data from the video memory into the repository, and the SP needs to access the repository when executing the specific calculation instruction. Read data in. Therefore, there are a large number of SPs and repositories in the SM (usually, the number of SPs in an SM is the same as the number of repositories in the SM), and each SP may need to access data in any set of repositories. As shown in FIG. 2, in the prior art, the repository and the SP are connected to each other through a fully connected network to form a matrix multiplier. The SP, the DP, and the repository in FIG. 2 are all connected to the fully connected network. In this way, mutual access between the SP and all the repositories is realized.

The SP mainly includes a calculation unit for performing the basic steps of matrix multiplication. Figure 3 is a diagram of the structure of a typical computing unit. As shown in FIG. 3, the calculation unit mainly includes four registers such as a register 301, a register 302, a register 303, and a register 304, a multiplication unit 305, and an addition unit 306. Wherein, the register 301 and the register 302 are placed into a multiplicand and a multiplier for multiplication, and the multiplying unit 301 multiplies the two numbers and adds them to the number placed in the register 303 (if not required) If an addition operation is performed, 0 can be placed in the register 303), and the result of the addition can be stored in the register 304, thereby completing a multiplication and addition calculation.

When the matrix multiplication operation A*B is performed, the elements of the Jth row and the Kth column of the result matrix D are equal to the Jth row of the first matrix and the Kth column of the second matrix, and the sum of the products of each element of the corresponding position. . For example, the matrix A and the matrix B are both a matrix of 4 rows and 4 columns, and then the elements of the first row and the first column of the matrix D are multiplied by each element of the first row of the matrix A by the first column of the matrix B, respectively. Elements, and add 4 products to add.

Based on the structure of the above SM, the current process of implementing matrix multiplication operation A*B is as follows:

First, the matrix is segmented according to the specifications of the multiplier in the SM to form a sub-matrix conforming to the multiplier specification, and the access unit further loads the segmented sub-matrix from the video memory into the repository. In particular, when the specification of the matrix to be divided is smaller than the specification of the multiplier, it is necessary to fill the corresponding position of the matrix to be divided with 0 to form a sub-matrix conforming to the specification.

Secondly, one data is read from the corresponding bank corresponding to the sub-matrix to be matrix multiplied to the corresponding SP. Since all SPs and repositories are connected to each other through a fully connected network, the SP can read the elements of the corresponding matrix A and matrix B into the calculation unit according to the calculation rule of matrix multiplication. For example, matrix A and matrix B are sub-matrices with a size of 4 rows and 4 columns (which can be expressed as 4*4). When calculating the value of d00 in the result matrix D, a00*b00+a01*b10 is required. Calculation of +a02*b20+a03*b30. Therefore, a00 and b00 are respectively taken out from the corresponding libraries of the matrix A and the matrix B, and placed in the register 301 and the register 302 in the calculation unit in the corresponding SP.

Finally, the SP uses the calculation unit to perform the multiply-and-accumulate operation. It should be noted that each time the multiply-accumulate operation is performed, the result is stored in a pre-prepared repository. After the next multiplication operation is completed, the multiplication and addition operation result is taken out from the storage space, and is added to the register 303 for addition. For example, after the calculation of a00*b00 is completed, the result is placed from the register 304. Prepared in the storage space in advance. When the calculation of a01*b10 is performed, the result of a00*b00 is put into the register 303 from the storage space, and the result of the adder 306 is added to the result of a00*b00, and the result is first placed in the register 304. Then save it in the corresponding storage space. By analogy, the final SP uses the multiplier to complete the calculation of a00*b00+a01*b10+a02*b20+a03*b30, and writes the result to the corresponding position in the repository through the fully connected network, and takes the value of d00.

When the matrix A*B of the size N*N is calculated by using the above algorithm and device, since only one element is taken out from each matrix for calculation at a time, and each calculation result is first stored in a predetermined storage space, It is called again in the second calculation. Therefore, to complete the multiplication of matrix A and matrix B, a total of N*N*N multiplication and addition operations are required, and 3*N*N*N read operations and N*N*N are required. Write once. And all SPs and repositories are connected through a fully connected network, which is less efficient and takes up more storage space.

For the purpose of improving the efficiency of matrix multiplication calculation by SM in GPU, embodiments of the present application provide a new matrix multiplier for use in a GPU. In the embodiment of the present application, when calculating the multiplication of the matrix A and the matrix B of the size N*N, no more elements are taken out from each matrix for calculation at a time, but a different set of repositories can be utilized. Simultaneous access characteristics, each time a row of elements of matrix A and a column of elements of matrix B are loaded into the corresponding computing unit, and calculations are performed at the same time. By doing so, the steps of completing the multiplication of matrix A and matrix B can be reduced, and the number of memory accesses required can be reduced, thereby improving the efficiency of SM for matrix multiplication calculation.

4 is a diagram showing the structure of a matrix multiplier provided by an embodiment of the present application. As shown in FIG. 4, matrix multiplier 400 includes a scheduler, a repository, and a computing unit. The scheduler is configured to obtain a matrix of a corresponding specification for calculation, and save the matrix in a corresponding repository set. The scheduler specifically includes a matrix multiplication scheduling unit 401, an instruction distribution unit 402, and an instruction distribution unit 403 (two shown in the figure, which may actually be one or more), wherein the matrix multiplication scheduling unit 401 functions as a matrix multiplier The instruction scheduling unit of 400 is mainly responsible for order sorting and scheduling. By inputting instructions, processes such as input, load, calculation, storage and output can be organically combined. The instruction distribution unit is connected to the storage library and the calculation unit (not shown) through the control connection, and is configured to send the scheduling instruction determined by the matrix multiplication scheduling unit 401 to the storage library and the calculation unit, thereby causing the repository and the calculation The unit processes the data according to the instructions. In the embodiment of the present application, the number of instruction distribution units included in the matrix multiplier 400 may be two, so that instruction dual transmission can be realized. The connection between the computing unit of the matrix multiplier of the present application and the repository is no longer connected through the fully connected network. As shown in FIG. 4, each matrix multiplier includes N*N computing units, and the N*N computing units are formed. An N*N matrix (illustrated as a 4*4 matrix, wherein the computing units are named from left to right, top to bottom, respectively, as computing unit 430 to computing unit 445), and each matrix multiplier also includes at least two A repository set, each repository set includes N repositories (illustrated as four repositories set having a total of 16 repositories, respectively named as repositories 410 to 425), and the first repository set is used to store input A first multiplication matrix in the matrix, the second repository set is used to store a second multiplication matrix in the input matrix. Optionally, the matrix multiplier may further include a third repository set and a fourth repository set, wherein the third repository set is used to store the result matrix, and the fourth repository set is used to store the addition matrix in the input matrix. The matrix multiplier of the present application can perform N*N multiplication calculations in one calculation cycle (in the computer field, also called clock cycle or beat), thereby improving computational efficiency. To this end, the first repository set is connected to the computing unit by way of row connection, and the N repositories in the first repository set are directly connected to the computing unit of each row in the N*N computing unit matrix, respectively. a first repository of a repository set is coupled to each compute unit of the first row of the N*N compute unit matrix, the second repository of the first repository set and the N*N compute unit matrix Each of the computing units of the second row is connected, the Nth repository of the first repository set is connected to each computing cell of the Nth row in the N*N computing cell matrix; the second repository set is connected by columns The method is connected to the computing unit, and the N computing units in the second repository set are directly connected to the computing unit of each column in the N*N computing unit matrix, and the first repository and the N of the second repository set are respectively connected. Each calculation unit of the first column in the calculation unit matrix of *N is connected, and the second storage pool of the second repository set is connected to each calculation unit of the second column in the calculation unit matrix of N*N, The Nth repository of the second repository set and N*N Each cell of N columns matrix calculation unit is connected. For example, as shown in FIG. 4, the first repository set includes a repository 410, a repository 411, a repository 412, and a repository 413. The first repository set and the computing unit matrix maintain row connections, and the storage library 410 and the matrix Each computing unit of a row is connected, the storage library 411 is connected to each computing unit of the second row in the matrix, and the storage library 412 is connected to each computing unit of the third row in the matrix, and the storage library 413 is in the matrix Each of the computing units of the fourth row is connected; the second repository set includes a repository 414, a repository 415, a repository 416, and a repository 417, the second repository set and the computing unit matrix maintain column connections, and the repository 414 Each of the computing units of the first column in the matrix is connected, the storage library 415 is connected to each computing unit of the second column in the matrix, and the storage library 416 is connected to each computing unit of the third column in the matrix, the storage library 417 Connected to each calculation unit in the fourth column of the matrix. According to the foregoing connection manner, the first repository set may broadcast N data to the N*N computing units in the first clock cycle, and the second repository set may also broadcast N to the N*N computing units in the first clock cycle. Data, each calculation unit can perform a multiplication calculation in the first clock cycle, and after N clock cycles, all multiplication calculations can be completed.

Further, the fourth repository set in the matrix multiplier is used to load the addition matrix C in the input matrix, and the application may connect the fourth repository set to the row of the computing unit matrix, or may set the fourth repository set. Maintaining a column connection with the computing unit matrix, and if the fourth repository set is connected to the computing unit matrix, each of the fourth repository sets and each row of the computing unit matrix The computing units are connected. For example, in FIG. 4, the storage library 418 is respectively connected to each computing unit of the first row in the computing unit matrix, and the storage library 419 is respectively associated with each computing unit of the second row in the computing unit matrix. Connected, the storage library 420 is respectively connected to each of the computing units of the third row in the computing unit matrix, and the storage library 421 is respectively connected to each of the computing units of the fourth row in the computing unit matrix. The fourth repository set can load data to the N computing units every clock cycle. After N clock cycles, the addition matrix C stored in the fourth repository set is all input to the corresponding computing unit, and then at the Nth clock. Cycle, you can perform the corresponding addition calculation. Further, the third repository set in the matrix multiplier is used to load the result matrix D in the input matrix, and the application may keep the third repository set connected to the computing unit matrix, or may be the third repository set. Maintaining a column connection with the computing unit matrix, and if the third repository set and the computing unit matrix maintain a column connection, each of the third repository sets and each column of the computing unit matrix The computing units are connected. For example, in FIG. 4, the storage library 425 is respectively connected to each computing unit of the first column in the computing unit matrix, and the storage library 424 and each computing unit of the second column in the computing unit matrix respectively. Connected, the storage 423 is connected to each of the computing units of the third column of the computing unit matrix, respectively, and the storage 422 is connected to each of the computing units of the fourth column of the computing unit matrix.

FIG. 5 is a computing unit provided by an embodiment of the present application to adapt matrix multiplication operations supported by matrix multiplier 400. As shown in FIG. 5, the calculation unit 500 may be any one of the above-described matrix multipliers 400, including five registers such as a register 501 to a register 505, a multiplication unit 506, an addition unit 507, and an addition unit 508. When the matrix A and the matrix B are matrix-multiplied, the elements of the first row and the first column of the matrix D are d00=a00*b00+a01*b10+a02*b20+a03*b30, then the first clock cycle A00 and b00 are respectively placed in the register 501 and the register 502, the multiplication unit 506 calculates a00*b00, and the addition unit 507 calculates the sum of the product of a00*b00 and the value stored in the register 503, and puts the result into the register. In 503, replace the previous value. In the initial state, the value stored in the register 503 is 0, so after the above calculation, the value stored in the register 503 is a00*b00. In the next second clock cycle, a01 and b10 can be placed in register 501 and register 502, respectively, using a multiplication unit 506 to calculate a01*b10, and in the next third clock cycle, a02 and b20 can be placed in registers respectively. 501 and register 502, using a multiplication unit 506 to calculate a02*b20, in the next fourth clock cycle, a03 and b30 can be placed in the register 501 and the register 502, respectively, and the multiplication unit 506 calculates a03*b30 at the fourth clock. After the period, the value in register 503 is a00*b00+a01*b10+a02*b20+a03*b30. When the value stored in the register 504 is 0, the value in the register 505 is the value in the register 503, and the value is taken out and stored in the storage space prepared in advance. When the value of the addition matrix C is stored in the register 504, The register 503 is added to the value in the register 504, and the result of the addition is used as the value of the first column element d00 of the first row of the result matrix. When the matrix D=A*B+C is calculated, the element d00=a00*b00+a01*b10+a02*b20+a03*b30+c00 in the result matrix D at this time can be calculated by the calculation unit 500. This is achieved by placing c00 in the register 504 and performing addition calculation using the addition unit 508 and the result of a00*b00+a01*b10+a02*b20+a03*b30.

FIG. 6 is a schematic flow chart of an embodiment of the present application.

S601: Before entering the matrix multiplier for calculation, the matrix to be operated is diced to form a submatrix of an N*N size specified by the adaptive matrix multiplier, and the submatrices are respectively stored in corresponding repositories. If the size of the matrix to be divided is smaller than N*N, then 0 is added to the corresponding position, thereby forming a submatrix of N*N size, which does not affect the calculation result. Continue to take N=4 as an example. A, B, C, and D are all matrices of 4 rows and 4 columns. The elements in the matrix are represented by aij, bij, cij, and dij, where i represents the element in the matrix. The number of rows in the field is decremented by 1, and j indicates that the number of columns of the element in the matrix is decremented by 1, and i and j are integers greater than or equal to 0 and less than or equal to 3. Referring to the initial state diagram shown in FIG. 7, the elements of the matrix A are respectively placed in the storage 410 to the storage 413 according to different number of rows, and the elements of the matrix B are respectively placed in the storage 414 to the storage according to different column numbers. In the library 417, the elements of the matrix C are placed in the repository 418 into the repository 421, respectively, according to different number of rows.

S602: in each time period, each of the storage library 410 to the storage library 417 broadcasts a data to all the connected computing units according to the received order according to the received instruction, and each of the multiply and add units respectively The elements from the matrix A and the elements from the matrix B are received and placed in the register 501 and the register 502 in the calculation unit respectively, and the product of the two is calculated according to the method mentioned above to obtain the product of the time, and the obtained product will be obtained. The product is added to the previous multiplication and addition calculation result, and the multiplication and addition calculation result is obtained. In the first time period, the previous calculation result, that is, the initial value of the register is 0.

Specifically, in the Mth time period, the storage 410 to the storage 413 transmit the elements of the Mth column of the matrix A to the computing unit set by broadcasting, wherein the computing unit receiving matrix in the Jth row The elements of the Mth row and the Mth column of A; the storage library 414 to the storage library 417 transmit the elements of the Mth row of the matrix B to the set of computing units by broadcasting, wherein the computing unit located in the Kth column receives the matrix B The elements of the Mth row and the Kth column, J, K, and M are positive integers of 4 or less. The multiplication calculation unit performs a multiplication calculation of the received elements from the matrix A and the matrix B to obtain an Mth multiplication calculation result, and adds the Mth multiplication calculation result to the M-1th multiplication and addition calculation result. The result of the Mth multiplication and addition calculation is obtained, wherein the 0th calculation result, that is, the initial value of the internal register is set to 0.

For example, referring to FIG. 8, in the first time period, the repository 410 places a00 into the computing unit 430 in the first row to the computing unit 433, and the repository 414 puts b00 into the calculation in the first column. The unit 430, the calculation unit 434, the calculation unit 348, and the calculation unit 442, wherein a00 and b00 are respectively placed in the register 501 and the register 502 of the calculation unit 430, and multiplied, and the result a00*b00 is put into the register. 503. The other registers in the repository 410 to the repository 417 also perform corresponding operations. In the second time period, as shown in FIG. 9, the storage library 410 and the storage library 414 respectively put a01 and b10 into corresponding computing units, and the computing unit 430 calculates the product of a01 and b10, and registers with the register 503. The stored a00*b00 is added and placed in the register 503 so that the value in the register 503 at this time is a00*b00+a01*b10. The operation in each time period is followed by the analogy, and the state of the third clock cycle shown in FIG. 10 and the state of the fourth clock cycle shown in FIG. 11 can be referred to.

At each time period, each bank in the repository 418 to the repository 421 places one of its saved elements of the matrix C into a register 504 in the corresponding computing unit in accordance with the received instruction in a predetermined order. in.

Specifically, in the Mth time period, each of the storage 418 to the storage 421 sends the Mth column element of the matrix C to the computing unit of the Mth column, where the Lth row is located The calculation unit of the M column receives an element located in the Mth column of the Lth row of the matrix C, and L is a positive integer of 4 or less.

For example, as shown in FIG. 8, in the first time period, the repository 418, the repository 419, the repository 420, and the repository 421 respectively put c00, c10, c20, and c30 into the computing unit 430, the computing unit 434, The calculation unit 438 and the register 504 in the calculation unit 442. Similarly, referring to FIG. 9, in the second time period, the storage library 418, the storage library 419, the storage library 420, and the storage library 421 put c01, c11, c21, and c31 into the computing unit 431 and the computing unit 435, respectively. , in the calculation unit 439 and the register 504 in the calculation unit 443, and so on.

S603: Step S602 is repeated. After four time periods, the repository 410 to the repository 417 have placed the elements of the matrix A and matrix B they store into the corresponding computational units and complete the corresponding multiply-and-accumulate calculations. For example, after four time periods, the calculation unit 430 completes a00*b00+a01*b10+a02*b20+a03*b30 and places the result in its own register 503. At the same time, the repository 418 to the repository 421 have placed the elements of the matrix C into the registers 504 of the corresponding computational unit. Referring to FIG. 12, in the fifth time period, the values in the register 503 and the register 504 are added by the adder 508 in each calculation unit, and the obtained result is used as an element of the result matrix D, and is put into each calculation. The unit's register 505.

S604: The elements of the result matrix D stored in the register 505 of each computing unit are sequentially stored in the corresponding storage in the storage library 422 to the storage library 425. Since each group of banks can only write one data at a time, it takes 4 time periods to write all the results to the destination repository.

S605: Move the elements of the result matrix D stored in the repository 422 to the repository 425 to the specified storage space.

The algorithm proposed in the embodiment of the present application can reduce the steps of completing the multiplication operation of the matrix A and the matrix B, thereby increasing the efficiency of the GPU for matrix multiplication. Specifically, since the method of broadcasting the elements of the matrix A and the matrix B to one row or one column of the computing unit is adopted, the calculation matrix A*B+C only needs to perform 3*N*N read operations and N*N. The secondary write operation greatly reduces the number of read and write operations compared to the prior art. At the same time, due to the direct connection between the storage library and the computing unit, the size of the chip space can be reduced.

It should be noted that the above reference numerals S601 to S605 are only used for reference, and it is not meant that the above steps need to be performed in a specific order in the embodiment of the present application.

In order to improve the working efficiency of the matrix multiplier provided by the present application, the present application designs two sets of instructions for external calling and internal control of the matrix multiplier, respectively.

For the external call instruction set of the matrix multiplier, the present application designs three instructions.

The first is the memory used to load the outer matrix into the matrix multiplier, such as mA=load_matrix_mmp(pA,m,n), where pA is a pointer to the outer matrix of the matrix multiplier and mA is the pointing matrix multiplier The pointer of the inner matrix A, m is the number of rows of the matrix A (or the number of column elements), and n is the number of columns of the matrix A (or the number of row elements). The effect of the instruction is to take A from The matrix multiplier is externally loaded into the matrix multiplier.

The second is used to perform multiplication and addition calculation of the matrix. For example, mD=matrix_mul_mmp(mA, mB, mC, m, n, k), where mA, mB, mC, mD are pointers to matrices A, B, C, D inside the matrix multiplier, m is matrix A The number of column elements of C, D, n is the number of row elements of matrix A, and is also the number of column elements of matrix B, and k is the number of elements of matrix B, C, and D rows. The effect of this instruction is to start the matrix multiplier for matrix multiplication operation D=A*B+C.

The third is used to copy the result matrix into the memory outside the matrix multiplier. For example, store_matrix_mmp(pD,mD,m,n), where pD is a pointer to the matrix of the matrix multiplier, mD is a pointer to the matrix inside the matrix multiplier, m is the number of elements of the matrix column, and n is the matrix The number of row elements. The effect of this instruction is to copy the matrix D to the space pointed by the pointer outside the matrix multiplier, where the size of the matrix D is m*n. For example, according to the above external call instruction set, when the matrix multiplier is used to calculate the matrix D=A*B+C of 4*4 size, the following manner can be set:

mA=load_matrix_mmp(pA,4,4);

mB=load_matrix_mmp(pB,4,4);

mC=load_matrix_mmp(pC,4,4);

mD=matrix_mul_mmp(mA,mB,mC,4,4,4)

Store_matrix_mmp(pD,mD,4,4)

For the internal call instruction set of the matrix multiplier, the present application designs two instructions.

The first is to load the elements of the matrix into a specific register of the calculation unit and multiply and accumulate the loaded elements. For example, Load_line_mmp(mA, mB, mC, n), where mA, mB, mC are pointers to matrix A, matrix B, matrix C, respectively, and n represents the number of the loaded row or column. The effect of the instruction is to load the nth column of the matrix A and the nth row of the matrix B into a specific register of the computing unit in the form of a broadcast, and load the nth column of the matrix C into a specific register of the computing unit, and The loaded matrix elements are multiplied and accumulated according to preset rules.

The second is used to perform matrix addition calculations and store the calculation results line by line to the specified storage space. For example, matrix_add_mmp(mD), mD is a pointer to matrix D. In combination with the above instruction, the effect of the instruction is that the multiplication and accumulation result of the matrix A and the matrix B are added to the matrix C, and the calculation result is stored as a result matrix D row by row to the storage space pointed to by the mD.

The matrix multiplier provided by the embodiment of the present application can be embedded in the GPU to efficiently implement matrix multiplication operations. Referring to FIG. 13, the GPU includes a storage controller (English name: Memory Controller, abbreviation: MMC), a Peripheral Component Interconnect Express (PCI-E) interface, and a thread engine (English full name). : Thread Engine), L2 cache (English; L2 Cache) and several components such as SM (L2 cache connection SM and storage controller, not shown). Among them, SM is the core computing component in the GPU, providing computing power for the entire GPU. It should be noted that the number of SMs included in the GPU is not fixed, but can be adjusted as needed. The number of SMs shown in FIG. 13 is for example only, and should not be construed as limiting the application. The matrix multiplier provided in the present application is located in the SM, which can reduce the occupied chip space and reduce the number of times of reading and writing data during matrix multiplication, thereby improving the matrix multiplication performance and energy efficiency ratio of the GPU.

The matrix multiplier provided by the embodiment of the present application can also be combined with the CPU core to construct an on-chip system (English name: System on a Chip, referred to as SoC), which can quickly process matrix multiplication and addition operations in the application. Figure 14 is a system on a chip including a matrix multiplier. As shown in FIG. 14, the system on chip includes a processor, a digital signal processing unit (English name: Digital Signal Processing, DSP for short), a codec (English: CODEC), and a matrix multiplier block (English full name: Matrix Multiplication Block). , referred to as: MMB), these components are connected through a secondary cache. Among them, the processor can be an advanced reduced instruction set machine processor (English full name: Advanced RISC Machine Processor). Referring to Figure 15, the MMP is composed of a number of matrix multipliers that are connected by a level one cache (English: L1 Cache). Based on the system on chip, it is only necessary to extract the matrix multiplication and addition operations in the application, and the calculation is performed by using the matrix multiplier provided by the embodiment of the present application, thereby improving the efficiency of application operation.

Claims

A matrix multiplier, the matrix multiplier comprising:

N*N computing units, the N*N computing units form a matrix of N*N, and N is a positive integer greater than or equal to 2;

Two repository sets, each repository set comprising N repositories, a first repository set for storing a first multiplication matrix in the input matrix, and a second repository set for storing a second multiplication matrix in the input matrix The Mth repository in the first repository set is connected to each computing unit of the Mth row in the N*N matrix, and the Mth repository in the second repository set is Each calculation unit of the Mth column in the matrix of N*N is connected, wherein M is a variable, and the value is 1≤M≤N;

Each computing unit of each of the N*N matrices is configured to receive first input data broadcast by a repository in the first repository set connected to itself, each clock cycle, Each of the computing units of each of the N*N matrices is configured to receive second input data broadcast by a repository in the second repository set connected to itself; at each clock cycle, Each calculation unit in the matrix of N*N performs multiplication calculation according to the received first input data and the second input data; after the end of the Nth clock period, the matrix multiplier completes the first A multiplication operation of a multiplication matrix and the second multiplication matrix.
The matrix multiplier according to claim 1, wherein, in each clock cycle, all of the computing units in the same row of the N*N matrix receive the same first input data, the N*N All of the computing units in the same column in the matrix receive the same second input data.
The matrix multiplier according to claim 1 or 2, wherein the matrix multiplier further comprises:

a third repository set, wherein the third repository set is used to store a result matrix, and the Mth repository in the third repository set and each of the Mth columns in the N*N matrix connection.
The matrix multiplier according to any one of claims 1 to 3, wherein the matrix multiplier further comprises: a fourth repository set, wherein the fourth repository set is used to store an addition matrix in the input matrix And the Mth repository in the fourth repository set is connected to each computing unit of the Mth row in the matrix of the N*N;

In a first clock cycle, each computing unit of the first column of the N*N matrix is configured to receive a first set of data input from a repository of the fourth repository set connected to itself, The first set of data is the first column data in the addition matrix, and each of the second columns of the N*N matrix in the second clock cycle is used to receive the connection with itself a second set of data input by the repository in the fourth repository set, the second set of data being the second column of data in the addition matrix, and so on, the N*N in the Nth clock cycle Each of the calculation units of the Nth column in the matrix is configured to receive an Nth group of data input from a repository in the fourth repository set connected to itself, the Nth group of data being in the addition matrix Column N data;

In the (N+1)th clock cycle, each of the N*N matrices is further configured to input data according to the received add matrix and the first multiplying matrix and the second The multiplication result of the multiplication matrix is subjected to an addition operation to obtain a multiplication and addition calculation result of the first multiplication matrix, the second multiplication matrix, and the addition matrix.
The matrix multiplier according to any one of claims 1 to 4, wherein the matrix multiplier further comprises: a scheduler, the scheduler is configured to obtain a first multiplication matrix and a second form in the form of an N*N matrix Multiplying the matrix and storing the first multiplication matrix and the second multiplication matrix in the first repository set and the second repository set, respectively.
A graphics processor, characterized in that the graphics processor comprises a matrix multiplier as claimed in any of claims 1-5.
A system on a chip, characterized in that the system on chip comprises a matrix multiplier as claimed in any of claims 1-5.
A calculation method for a matrix multiplier to perform calculation, wherein the matrix multiplier comprises: N*N calculation units, the N*N calculation units form a matrix of N*N, and N is greater than or equal to a positive integer of 2; two repository sets, each repository set includes N repositories, a first repository set is used to store a first multiplication matrix in the input matrix, and a second repository set is used to store the input matrix a second multiplication matrix, the Mth repository in the first repository set is connected to each of the M rows in the N*N matrix, and the Mth in the second repository set The storage library is connected to each computing unit of the Mth column in the matrix of the N*N, wherein M is a variable, and the value is 1≤M≤N;

The method includes:

In a first clock cycle, each computing unit of each of the N*N matrices receives first input data broadcast by a repository in the first repository set connected to itself, Each computing unit of each column in the matrix of N*N is configured to receive second input data broadcast by a repository in the second repository set connected to itself, in the matrix of the N*N Each calculation unit performs multiplication calculation according to the first input data and the second input data to obtain a first multiplication calculation result, and each calculation unit in the N*N matrix calculates the first multiplication calculation result Adding with the initial value in the internal register to obtain a first multiplication and addition result, and storing the first multiplication and addition calculation result calculated by itself in the internal register;

In a second clock cycle, each computing unit of each of the N*N matrices receives first input data broadcast by a repository in the first set of repositories connected to itself, the N Each computing unit of each column in the matrix of *N is configured to receive second input data of a repository broadcast in the second repository set connected to itself, each of the N*N matrices The calculating unit performs multiplication calculation according to the first input data and the second input data to obtain a second multiplication calculation result, and each calculation unit in the N*N matrix calculates a second multiplication calculation result calculated by itself Adding with the first multiplication and addition calculation result to obtain a second multiplication and addition result, and storing the second multiplication and addition calculation result in an internal register;

In subsequent clock cycles, the calculation is performed by analogy until the matrix multiplier completes the multiplication of the first multiplication matrix and the second multiplication matrix after the Nth clock cycle.
The calculation method according to claim 8, wherein, in each clock cycle, all the computing units in the same row in the matrix of the N*N receive the same first input data, the N*N All computing units in the same column in the matrix receive the same second input data.
The calculation according to claim 8 or 9, wherein the matrix multiplier further comprises:

a third repository set, wherein the third repository set is used to store a result matrix, and the Mth repository in the third repository set and each of the Mth columns in the N*N matrix connection;

The method further includes:

Each of the N*N matrices outputs the calculated Nth multiply-accumulate calculation result to a repository in the third repository set connected to itself.
The calculation method according to any one of claims 8 to 10, wherein the matrix multiplier further comprises: a fourth repository set, wherein the fourth repository set is used to store an addition matrix in the input matrix, The Mth repository in the fourth repository set is connected to each computing unit of the Mth row in the matrix of the N*N;

The method further includes:

In a first clock cycle, each computing unit of the first column of the N*N matrix is configured to receive a first set of data entered by a repository in the fourth repository set connected to itself, The first set of data is the first column data in the addition matrix, and each of the second columns of the N*N matrix in the second clock cycle is used to receive the connection with itself a second set of data input by the repository in the fourth repository set, the second set of data being the second column of data in the addition matrix, and so on, the N*N in the Nth clock cycle Each of the calculation units of the Nth column in the matrix is configured to receive the Nth group of data input by the repository in the fourth repository set connected to itself, the Nth group of data being in the addition matrix Column N data;

In the (N+1)th clock cycle, each of the N*N matrices is further configured to input data according to the received add matrix and the first multiplying matrix and the second The Nth multiplication and addition calculation result of the multiplication matrix is added to obtain a multiplication and addition calculation result of the first multiplication matrix, the second multiplication matrix, and the addition matrix.