CN111079081A

CN111079081A - Matrix multiplier, data processing method, integrated circuit device and processor

Info

Publication number: CN111079081A
Application number: CN201911302512.2A
Authority: CN
Inventors: 左航
Original assignee: Haiguang Information Technology Co Ltd
Current assignee: Chengdu Haiguang Microelectronics Technology Co Ltd
Priority date: 2019-12-16
Filing date: 2019-12-16
Publication date: 2020-04-28
Anticipated expiration: 2039-12-16
Also published as: WO2021120711A1; CN111079081B; WO2021120711A8

Abstract

The application relates to a matrix multiplier, a data processing method, an integrated circuit device and a processor. The matrix multiplier includes: an LDS for storing the first matrix in a row order; k VGPRs for storing respective columns in the second matrix, each VGPR storing a column of the second matrix; the LDS is connected with each VSP through a bus, so that elements in the first matrix are loaded to the K VSPs one by one in parallel and are multiplied by elements corresponding to columns stored in the K VGPRs respectively, the K VSPs sequentially accumulate multiplication results generated by the elements in the same row of the first matrix and the corresponding elements of the second matrix one by one in parallel to obtain all the elements in the same row of the third matrix, and multiplication operation of the first matrix and the second matrix is completed. The matrix multiplier can perform parallel calculation on all elements in the same row of the third matrix, and the times of acquiring the elements from the first matrix are greatly reduced.

Description

Matrix multiplier, data processing method, integrated circuit device and processor

Technical Field

The application belongs to the technical field of computers, and particularly relates to a matrix multiplier, a data processing method, an integrated circuit device and a processor.

Background

In the field of computers, with the maturity of emerging technologies such as big data and machine learning, more and more tasks include various matrix multiplication operations. Currently, to calculate the product of two matrices a and B, the calculation can be performed by any one of the following ways:

in the first mode, both the matrix a and the matrix B are pre-loaded into a Vector General Purpose Register (VGPR), and when multiplication is performed, the row of the matrix a and the column of the matrix B are taken for operation.

In the second mode, both the matrix a and the matrix B are preloaded into a Local Data Share unit (LDS), and when multiplication is performed, the matrix a and the matrix B are loaded into the VGPR, and then multiplication is performed.

And thirdly, pre-loading the matrix A to the LDS, pre-loading the matrix B to the VGPR, loading the matrix A to the VGPR line by line when A is performed, and then performing multiplication.

Disclosure of Invention

The application aims to provide a matrix multiplier, a data processing method, an integrated circuit device and a processor.

The embodiment of the application is realized as follows:

in a first aspect, an embodiment of the present application provides a matrix multiplier, including: the system comprises a local data sharing unit, K vector general registers and K vector stream processors which are connected with the K vector general registers in a one-to-one corresponding mode; the local data sharing unit is used for storing a first matrix according to the row sequence, wherein the first matrix is an M x N matrix; the vector general registers are used for storing each column in a second matrix, each vector general register stores one column of the second matrix, the second matrix is an N x K matrix, and K is an integer greater than or equal to 2; the local data sharing unit is connected with each of the K vector stream processors through a bus, so that elements in the first matrix are loaded to the K vector stream processors one by one in parallel and are multiplied by elements corresponding to columns stored in the K vector general registers respectively, the K vector stream processors accumulate multiplication results generated by the elements in the same row of the first matrix and the corresponding elements of the second matrix one by one in parallel to obtain all elements in the same row of a third matrix, and multiplication operation of the first matrix and the second matrix is completed.

In the embodiment of the application, the local data sharing unit is connected with each vector stream processor through the bus, and through the path, the elements in the first matrix stored in the local data sharing unit can be directly loaded into the K vector stream processors in parallel, so that the loading operation of loading data from the local data sharing unit → the vector general register → the vector stream processor is omitted, additional read-write operation is reduced, the problem of occupation of VGPR space is optimized, meanwhile, the matrix multiplier can perform parallel calculation on all elements in the same row of the third matrix through the path, the number of times of obtaining the elements from the first matrix is greatly reduced, and further the system overhead can be reduced.

With reference to one possible implementation manner of the embodiment of the first aspect, the matrix multiplier further includes: and the logic change register is connected with each vector stream processor and is used for storing an address for reading each element in the first matrix, and automatically updating to an address corresponding to the next element after the K vector stream processors read the corresponding element in the first matrix from the local data sharing unit according to the current address of the logic change register in parallel. In the embodiment of the present application, the logic change register may automatically update the address corresponding to the next element after the vector stream processor reads the corresponding element from the local data sharing unit according to the current address, and the vector stream processor does not need to actively update the address.

With reference to a possible implementation manner of the embodiment of the first aspect, the matrix multiplier further includes a controller connected to each of the vector general purpose registers, and the controller is configured to send a multiplication instruction to the K vector stream processors in parallel to instruct the K vector stream processors to multiply the first matrix with the second matrix. The controller sends multiplication instructions to the K vector stream processors in parallel (simultaneously) to instruct the K vector stream processors to multiply the first matrix and the second matrix, so that the K vector stream processors can synchronously carry out corresponding operations.

With reference to a possible implementation manner of the embodiment of the first aspect, the controller is further connected to the local data sharing unit and each of the vector general purpose registers, and the controller is further configured to store elements in the first matrix to the local data sharing unit according to a row order, and further configured to correspondingly store columns in the second matrix to the K vector general purpose registers according to a column order. In the embodiment of the application, the controller stores the elements in the first matrix into the local data sharing unit according to the row sequence, and correspondingly stores each column in the second matrix into the K vector general registers, so that when the first matrix and the second matrix are subjected to multiplication, calculation can be performed on all elements in the same row of the third matrix, the number of times of obtaining the elements from the first matrix is greatly reduced, and further, the system overhead can be reduced.

In a second aspect, an embodiment of the present application further provides a data processing method, which is applied to a matrix multiplier, where the matrix multiplier includes: the system comprises a local data sharing unit, K vector general registers and K vector stream processors, wherein the K vector stream processors are connected with the K vector general registers in a one-to-one correspondence mode, and the local data sharing unit is connected with each of the K vector stream processors through a bus; the method comprises the following steps: the K vector stream processors parallelly acquire elements in a first matrix stored in advance one by one from the local data sharing unit according to a row sequence; the K vector stream processors parallelly acquire corresponding elements from a second matrix which are stored in advance from corresponding vector general registers; the K vector stream processors multiply the acquired elements from the first matrix with corresponding elements from the second matrix respectively; and the K vector stream processors accumulate multiplication results generated by the elements in the same row of the first matrix and the corresponding elements of the second matrix one by one in parallel to obtain all the elements in the same row of the third matrix.

In combination with a possible implementation manner of the embodiment of the second aspect, the matrix multiplier further includes a logic change register connected to each of the vector stream processors, and configured to store an address for reading each element in the first matrix, and automatically update to an address corresponding to a next element after each of the vector stream processors reads a corresponding element in the first matrix from the local data sharing unit according to a current address of the logic change register in parallel; the K vector stream processors parallelly acquire elements in a prestored first matrix one by one according to a row sequence from the local data sharing unit, and the method comprises the following steps: and the K vector stream processors parallelly and sequentially acquire the elements in the pre-stored first matrix from the local data sharing unit according to the current address of the logic change register in a row sequence.

With reference to one possible implementation manner of the embodiment of the second aspect, the matrix multiplier further includes: a controller connected to the local data sharing unit, before the K vector stream processors fetch elements of a pre-stored first matrix one by one in row order from the local data sharing unit in parallel, the method further comprising: the controller stores the elements in the first matrix to the local data sharing unit in row order.

With reference to one possible implementation manner of the embodiment of the second aspect, the matrix multiplier further includes: a controller respectively connected to each of the K vector general purpose registers via a bus, the method further comprising, before the K vector stream processors fetch in parallel from the respective corresponding vector general purpose registers the pre-stored corresponding elements from the second matrix: and the controller correspondingly stores each column in the second matrix into the K vector general registers according to the column sequence, and each vector general register stores one column of the second matrix.

With reference to one possible implementation manner of the embodiment of the second aspect, the matrix multiplier further includes: a controller connected to each of the vector stream processors, the K vector stream processors being in parallel before fetching elements of a pre-stored first matrix one by one in row order from the local data sharing unit, the method further comprising: the controller sends multiplication instructions to the K vector stream processors in parallel to instruct the K vector stream processors to multiply the first matrix with the second matrix.

In combination with a possible implementation manner of the embodiment of the second aspect, after the K vector stream processors sequentially accumulate, in parallel, multiplication results generated by elements in a same row of the first matrix and corresponding elements of the second matrix one by one, the method further includes: the K vector stream processors store accumulated results in parallel in row order to a region of the local data sharing unit that does not overlap the first matrix. In the embodiment of the application, the K vector stream processors store the accumulated sum results into the local data sharing unit in parallel according to the row sequence in an area which is not overlapped with the first matrix, so as to reduce the occupation of the memory of the vector stream processors.

In a third aspect, an embodiment of the present application further provides an integrated circuit device, including a substrate and a matrix multiplier, provided on the substrate, as provided in the embodiment of the first aspect and/or in combination with any one of the possible implementations of the embodiment of the first aspect.

In a fourth aspect, an embodiment of the present application further provides a processor, which includes the integrated circuit device provided in the foregoing third aspect.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and drawings.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts. The foregoing and other objects, features and advantages of the application will be apparent from the accompanying drawings. Like reference numerals refer to like parts throughout the drawings. The drawings are not intended to be to scale as practical, emphasis instead being placed upon illustrating the subject matter of the present application.

Fig. 1 shows a schematic structural diagram of a matrix multiplier provided in an embodiment of the present application.

Fig. 2 shows a schematic structural diagram of another matrix multiplier provided in the embodiment of the present application.

Fig. 3 shows a flowchart of a data processing method provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, relational terms such as "first," "second," and the like may be used solely in the description herein to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without K being more limiting, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Further, the term "and/or" in the present application is only one kind of association relationship describing the associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone.

In the first conventional method, both the matrix a and the matrix B are pre-loaded into a Vector General Purpose Register (VGPR), and when multiplication is performed, the row of the matrix a and the column of the matrix B are taken for operation. The scheme needs to pre-load the whole matrix A and the matrix B into the VGPR, a large amount of VGPR space is wasted, the VGPR space is limited, the size of the matrix needs to be limited, and meanwhile, the system performance is poor due to the fact that a large amount of VGPR space is wasted.

In the second conventional method, both the matrix a and the matrix B are preloaded into a Local data share unit (LDS), and when multiplication is performed, the matrix a and the matrix B are loaded into the VGPR, and then multiplication is performed. Although the scheme can save some VGPR space, a large amount of LDS space is needed, two times of additional reading and writing from LDS to VGPR are added, the power consumption is increased due to the additional reading and writing operation, and the performance is also reduced.

In the third conventional method, the matrix a is preloaded to the LDS, the matrix B is preloaded to the VGPR, and when a × B is performed, the matrix a is preloaded to the VGPR line by line, and then multiplication is performed. Although this scheme may save some VGPR space and also does not require loading the entire matrix a to VGPR, there are still a lot of additional read and write operations with respect to matrix a, e.g., writing matrix a to LDS, reading matrix a from LDS, writing matrix a to VGPR, and reading matrix a from VGPR. The extra read and write operations consume a lot of power consumption, so the problem with this solution is that the power consumption is too high.

Through research and analysis, the matrix A is preloaded to the LDS, and the matrix B is preloaded to the VGPR, so that VGPR resources are saved, all hardware resources are used comprehensively, and the problems that the number of required read-write operations is large and the VGPR space is occupied in the existing computing mode are solved. According to the scheme, the matrix A is directly broadcast to a Vector Stream Processor (VSP) by using the LDS _ DIRECT path, and the loading operation from LDS → VGPR → VSP is omitted, so that no additional reading and writing operation is required, and the power consumption performance is good. The matrix multiplier and the data processing method thereof according to the embodiments of the present application will be explained below.

Referring to fig. 1, a schematic structural diagram of a matrix multiplier according to an embodiment of the present application is shown, and a structure of the matrix multiplier will be described with reference to fig. 1. The matrix multiplier includes: a Local Data Share unit (LDS), a plurality of Vector General Purpose Registers (VGPR), and a plurality of Vector Stream Processors (VSP) connected in a one-to-one correspondence with the plurality of Vector General Purpose registers. The local data sharing unit LDS may be a Random Access Memory (RAM), a register array, or the like.

The local data sharing unit is configured to store a first matrix (e.g., matrix a) in a row order, where M, N is an integer greater than or equal to 1 when the first matrix is an M × N matrix, and store the matrix a in a row order when the matrix a is loaded to the LDS, where the matrix a is stored, for example, a₁₁、A₁₂、……、A_1N-1、A_1N；A₂₁、A₂₂、……、A_2N-1、A_2N；……；A_M1、A_M2、……、A_MN-1、A_MN。

A plurality of vector general purpose registers VGPR for storing respective columns in a second matrix (e.g., matrix B), each vector general purpose register storing one column of the second matrix, i.e., one vector general purpose register storing one column, and the different vector general purpose registers storing different columns. It should be noted that the number of columns of the second matrix is less than or equal to the number of vector general purpose registers, for example, when the second matrix is an N × K matrix, the number of vector general purpose registers is greater than or equal to K, and K is an integer greater than or equal to 2. When loading matrix B into K VGPRs, one VGPR stores one column and the columns stored by different VGPRs are different, e.g., the column stored by the first VGPR is: b is₁₁、B₂₁、……、B_N-11、B_N1(ii) a The second VGPR stores a column as: b is₁₂、B₂₂、……、B_N-12、B_N2(ii) a The column that the K-1 VGPR stores is: b is_1K-1、B_2K-1、……、B_N-1K-1、B_NK-1(ii) a The column stored by the Kth VGPR is: b is_1K、B_2K、……、B_N-1K、B_NK。

The vector stream processors are connected with the vector general registers in a one-to-one correspondence mode, namely one vector general register corresponds to one vector stream processor, so that the vector stream processors can conveniently acquire data from the corresponding vector general registers.

The local data sharing unit is connected to each of the plurality of vector stream processors through a bus (LDS-Direct in the figure) so that the elements in the first matrix can be loaded into the plurality of vector stream processors one by one in parallel. Since the second matrix in this embodiment is an N × K matrix, only K vector general purpose registers are needed to store each column in the second matrix, and therefore, in the following description, only K vector general purpose registers and K vector stream processors are used for description (it is understood that the number of the vector general purpose registers and the number of the vector stream processors may be greater than or equal to K). At this time, the local data sharing unit is connected with each of the K vector stream processors through the bus, so that the elements in the first matrix are loaded to the K vector stream processors one by one in parallel, and are multiplied by the elements corresponding to the columns stored in the K vector general registers, and the K vector stream processors accumulate the multiplication results generated by the elements in the same row of the first matrix and the corresponding elements of the second matrix one by one in parallel, that is, each vector stream processor accumulates the multiplication results generated by the elements in the same row of the first matrix and the corresponding elements of the second matrix one by one in sequence, so as to obtain all the elements in the same row of the third matrix, thereby completing the multiplication operation of the first matrix and the second matrix.

For ease of understanding, A is multiplied in a matrix below_64x64*B_64x64＝C_64x64That is, M, N, K is explained as being equal to 64, but it is to be understood that 64x64 is merely an example and not limited thereto. It should be noted that, the multiplication of the two matrices requires that the number of columns (Column) of the first matrix is the same as the number of rows (Row) of the second matrix, and only the number of columns (Column) of the first matrix is the same as the number of rows (Row) of the second matrixIt makes sense that the rows (Row) of the two matrices are the same. For example, the first matrix is an M × N matrix and the second matrix is an N × K matrix. In performing the multiplication, 64 VSPs read the elements of matrix A (A) from LDS in parallel₁₁，A₁₂，…，A₁₆₄，A₂₁，A₂₂，…，A₆₄₆₄) And obtaining corresponding elements in matrix B from respective corresponding VGPR in parallel, each of the 64 VSPs multiplying the obtained elements from the first matrix with corresponding elements from the second matrix; each of the 64 VSPs (64 VSPs are executed in parallel) sequentially accumulates the multiplication results of the elements in the same row of the matrix a with the corresponding elements of the second matrix one by one, resulting in all the elements in the same row of the matrix C. The calculation process can be represented by table 1.

TABLE 1

As can be seen from Table 1 above, at time CLK1, A₁₁Are loaded in parallel into 64 VSPs, multiplied by the elements corresponding to the columns stored in each of the 64 VGPRs, and at time CLK2, A₁₂Are loaded in parallel into 64 VSPs, multiplied by the elements corresponding to the columns stored in each of the 64 VGPRs, since A₁₁And A₁₂Belonging to the same row of the first matrix, so that each VSP adds the multiplication results corresponding to each element from the same row of said first matrix, namely A at time CLK2₁₂*B₂₁+C₁₁The calculation principle is the same at the subsequent time. It should be noted that, for the same Stage, VSP1 is taken as an example, and C in the current time is₁₁Indicating the result of the previous time calculation, i.e. C₁₁E.g. C at time CLK2₁₁Representative is C of CIK1₁₁E.g. C at time CLK3₁₁Representative is C of CIK2₁₁… …, e.g. C at time CLK64₁₁Representative is C of CIK63₁₁. It can be seen that one Stage is used to calculate all elements of the same row of the third matrix, each Stage containing 64 CLK (here for the example A_64x64*B_64x64＝C_64x64So one Stage contains 64 CLKs), each CLK reads one element of matrix a. For example, Stage1 is used to calculate the 1 st row of matrix C, Stage1 is used to calculate the 2 nd row of matrix C, and so on, specifically, taking the first row of matrix C as an example, there are:

VSP1:C₁₁＝A₁₁*B₁₁+A₁₂*B₂₁+A₁₃*B₃₁+A₁₄*B₄₁+…+A₁₆₄*B₆₄₁；

VSP2:C₁₂＝A₁₁*B₁₂+A₁₂*B₂₂+A₁₃*B₃₂+A₁₄*B₄₂+…+A₁₆₄*B₆₄₂；

VSP3:C₁₃＝A₁₁*B₁₃+A₁₂*B₂₃+A₁₃*B₃₃+A₁₄*B₄₃+…+A₁₆₄*B₆₄₃；

……

VSP64:C_164＝A₁₁*B₁₆₄+A₁₂*B₂₆₄+A₁₃*B₃₆₄+A₁₄*B₄₆₄+…+A₁₆₄*B₆₄₆₄；

it will be readily apparent that each element in A will be loaded in parallel into the various VSPs, such as A described above₁₁、A₁₂、A₁₃And so on, by the elements corresponding to the columns stored in each of the 64 VGPRs, as described above A₁₁*B₁₁、A₁₁*B₁₂、A₁₁*B₁₃、…、A₁₁*B₁₆₄And then each VSP sequentially accumulates the multiplication results generated by the elements in the same row of the first matrix and the corresponding elements of the second matrix one by one to obtain all the elements in the same row of the third matrix, and the VSP1 sequentially accumulates the multiplication results generated by the elements in the same row of the first matrix and the corresponding elements of the second matrix one by one to obtain C₁₁(ii) a VSP2 accumulates the multiplication results of elements in the same row of the first matrix and corresponding elements of the second matrix one by one to obtain C₁₂(ii) a VSP64 accumulates the multiplication results of elements in the same row of the first matrix and corresponding elements of the second matrix one by one to obtain C₁₆₄。

As can be seen from the above example, in the embodiment of the present application, when calculating the elements in the third matrix C, all elements in the same row of the third matrix are calculated simultaneously, which is different from the prior art in which one element is calculated separately, for example, the prior art calculates C completely₁₁Then, calculate C again₁₂… in addition, in the prior art, when calculating the matrix C, it is necessary to load both the matrix a and the matrix B into the VGPR, and when performing matrix multiplication, vector dot product is directly performed. To calculate C₁₁For example, C₁₁＝A₁₁*B₁₁+A₁₂*B₂₁+A₁₃*B₃₁+A₁₄*B₄₁+…+A₁₆₄*B₆₄₁. Note that this operation of taking two operands from the VGPR is calculated for each element of the product matrix C. For example, C is calculated sequentially in the above calculation manner₁₁，C₁₂，C₁₃The calculation order may be row by row, column by column, etc. without limitation.

It can be seen that the calculation method in the present application can greatly reduce the number of times of obtaining elements from the matrix a, for example, when all elements in the first row of the matrix C are calculated, each element in the first row of the matrix a only needs to be obtained once, 64 elements in total, and only 64 times, whereas in the prior art, each time one element in the first row of the matrix C is calculated, all elements in the first row of the matrix a need to be obtained once, and when the calculation of all elements (64 elements in total) in the first row of the matrix C is completed, all elements in the first row of the matrix a need to be obtained 64 times repeatedly, so 64 times of obtaining are needed. The number of acquisitions required in the calculation of the other rows of the matrix C is the same as the number of acquisitions required for the calculation of all the elements of the first row, so that the multiplication of the first matrix and the second matrix, i.e. a, is completed_64x64*B_64x64＝C_64x64In the embodiment of the present application, the total number of times of the matrix acquiring the elements from the matrix a is 64 × 64, and the existing calculation method is adoptedThe required times are 64 × 64, so that the times of obtaining elements from the system are greatly reduced, the power consumption of the system is reduced, and the performance is enhanced. In addition, in the embodiment of the application, since the LDS is connected with each VSP through the bus, each VSP can directly acquire all elements in the matrix a from the LDS, and through this path, the operation of loading data from LDS → VGPR → VSP is omitted. In the prior art, the elements in the LDS need to be loaded into the VGPR first, and then the elements are obtained from the VGPR, which further increases additional read-write operations. It should be noted that the number of read operations is for one VSP. The defects existing in the prior art are the results obtained after the inventor practices and researches, so that the discovery process of the above problems and the solution proposed by the embodiments of the present application in the present application to the above problems should be the contribution of the inventor to the present application in the process of the present application.

The matrix multiplier shown in the present application can minimize the use of VGPR, which includes only matrix B, i.e., 64x64 elements, and LDS, which includes only matrix a, i.e., 64x64 elements. The present application can also minimize access to VSPs: it is only necessary to read matrix a from LDS to VSP, which includes 64x64 accesses, and matrix B from VGPR to VSP, which includes 64x64, and similarly, the read operations of VGPR are compressed to a minimum, which includes only matrix B accesses, for a total of 64x64x64 reads.

How to efficiently perform matrix multiplication is crucial to many computer applications, so in this application, for the matrix a and the matrix B performing matrix multiplication, one of the matrices, i.e., a first matrix such as the matrix a, is stored in the local data sharing unit in advance according to the row sequence, and the other matrix, i.e., a second matrix such as the matrix B, is stored in K vector general purpose registers, where each vector general purpose register stores one column of the second matrix, i.e., one vector general purpose register stores one column, and the columns stored by different vector general purpose registers are different. When matrix multiplication is carried out, elements in a first matrix are loaded to K vector stream processors one by one in parallel and are multiplied by elements corresponding to columns stored in K vector general registers respectively, and the K vector stream processors accumulate multiplication results generated by the elements in the same row of the first matrix and the corresponding elements of a second matrix one by one in parallel to obtain all elements in the same row of a third matrix, so that the multiplication operation of the first matrix and the second matrix is completed.

When the elements in the first matrix are stored in the local data sharing unit in the row sequence, each element corresponds to a unique address, so that each VSP acquires the element corresponding to the address from the local data sharing unit according to the address during multiplication. For example, A₁₁→LDS(Address1)；A₁₂→LDS(Address2)；A₁₃→ LDS (Address 3); … it should be noted that the addresses corresponding to each element are different, and the addresses corresponding to the elements in the same row can be continuously increased as in the above example, or continuously decreased, such as a₁₁→LDS(Address4096)；A₁₂→LDS(Address4095)…A₆₄₆₄→ LDS (Address 1). Furthermore, it may be discontinuous, such as 1, 3, 5, 7, … …, or discontinuous, such as 1, 2, 4, 7, 11, 16, and thus the above examples should not be construed as limiting the present application.

Because each element corresponds to an Address, after an element is obtained according to the current Address, the current Address needs to be updated to the Address corresponding to the next element, for example, the Address A is obtained according to the current Address1₁₁Then, the current Address needs to be updated to Address2, and if the VSP is used to actively update the Address, the Address needs to be updated once after each time an element in matrix a is acquired, which is certainly very time-consuming, and therefore, as an embodiment, in order to improve the efficiency of acquiring the element in matrix a, the matrix multiplier further includes a logic change register, where M0 in fig. 1 represents the logic change register. The logic change register is connected with each vector stream processor and used for storing and reading the address of each element in the first matrix and automatically updating to the next vector stream processor after the K vector stream processors read the corresponding element in the first matrix from the local data sharing unit according to the current address of the logic change register in parallelThe address to which the element corresponds. The logic change register obtains A in the K vector stream processors in parallel according to the current Address of the logic change register, such as Address1₁₁The Address is then automatically updated to the Address corresponding to the next element, such as to Address 2.

To facilitate storing elements in the first matrix to the local data sharing unit and storing respective columns in the second matrix to the vector general purpose register, the matrix multiplier further comprises: a controller as shown in fig. 2. The controller is respectively connected with the local data sharing unit and each vector general register. The controller is used for storing the elements in the first matrix to the local data sharing unit according to the row sequence, and correspondingly storing each column in the second matrix to the K vector general registers according to the column sequence, wherein the storage format is shown in Table 2.

TABLE 2

VGPR1	VGPR2	……	VGPR64
				B₁₁	B₁₂	……	B₁₆₄
B₂₁	B₂₂	……	B₂₆₄
				……	……	……	……
B₆₄₁	B₆₄₂	……	B₆₄₆₄

The controller is also coupled to each of the vector general purpose registers, and is further configured to send multiplication instructions in parallel to the K vector stream processors to instruct the K vector stream processors to multiply the first matrix with the second matrix. As in A above_64x64*B_64x64＝C_64x64For example, the controller sends a multiplication instruction to 64 VSPs at the same time, so that the 64 VSPs obtain elements in a first matrix stored in advance one by one in a row sequence from the local data sharing unit in parallel, obtain corresponding elements in a second matrix from respective corresponding vector general registers in parallel, multiply the obtained elements from the first matrix with the corresponding elements from the second matrix, and finally accumulate multiplication results generated by the elements in the same row of the first matrix and the corresponding elements of the second matrix one by one in sequence to obtain all elements in the same row of the third matrix, thereby completing multiplication operations of the first matrix and the second matrix.

Furthermore, in order to reduce the occupation of VSP memory, after the K vector stream processors sequentially accumulate the multiplication results generated by the elements in the same row of the first matrix one by one with the corresponding elements of the second matrix in parallel, that is, after all the elements in the same row of the third matrix are obtained, the K vector stream processors store the accumulated results (the total result after the addition) in parallel in a row order to an area of the LDS that does not overlap with the first matrix, for example, the VSP1 in obtaining C₁₁Then, adding C₁₁Address1, VSP2 stored in LDS gets C₁₂Then, adding C₁₂Address2, … …, VSP64 stored in LDS gets C₁₆₄Then, adding C₁₆₄The addresses 64 stored in the LDS are all regions where the addresses 1-64 do not overlap with the regions where the unread elements in the first matrix are located. Of course, the K vector stream processors can store the accumulated result in parallel in the corresponding VGPR, and in the area not overlapping with the second matrix, for example, VSP1 is obtaining C₁₁Then, adding C₁₁In the region stored in VGPR1 and not overlapping with column 1 of the second matrix, VSP2 is obtaining C₁₂Then, adding C₁₂The region stored in VGPR2 and not overlapping with column 2 of the second matrix, … …, VSP64 is getting C₁₆₄Then, adding C₁₆₄The region stored in VGPR64 that does not overlap with column 64 of the second matrix.

The matrix multiplier according to the present invention can be applied to a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), or a circuit device capable of independently performing operations, and those skilled in the art should understand that the scope of the present invention is not limited thereto, and any way of implementing the matrix multiplier according to the present invention that is known to those skilled in the art and can be easily conceived of as a change or an alternative within the technical scope of the present invention should be covered by the scope of the present invention.

The embodiment of the present application further provides a data processing method applied to the matrix multiplier, which will be described below with reference to the flowchart shown in fig. 3.

Step S101: the K vector stream processors fetch elements in a pre-stored first matrix one by one in row order from the local data sharing unit in parallel.

Multiplication by a matrix_64x64*B_64x64＝C_64x64For example, 64 VSPs acquire the elements (A) of the pre-stored first matrix one by one in row order from LDS in parallel₁₁，A₁₂，…，A₁₆₄,A₂₁，A₂₂，…，A₆₄₆₄). Since each element in the first matrix corresponds to a unique address when being stored in the LDS, each VSP can obtain the element corresponding to the address from the local data sharing unit according to the address. Because each element corresponds to an Address, after an element is obtained according to the current Address, the current Address needs to be updated to the Address corresponding to the next element, for example, the Address A is obtained according to the current Address1₁₁Then, the current Address needs to be updated to Address2, and if the VSP is used to actively update the Address, the Address needs to be updated once after each element in the matrix a is acquired, which is very time consuming. The logic change register is connected with each vector stream processor and used for storing and reading the address of each element in the first matrix, and after the K vector stream processors read the corresponding element in the first matrix from the local data sharing unit in parallel according to the current address of the logic change register, the logic change register automatically updates the address corresponding to the next element. Correspondingly, the K vector stream processors acquire the elements in the pre-stored first matrix one by one in the row order from the local data sharing unit in parallel, and specifically, the K vector stream processors acquire the elements in the pre-stored first matrix one by one in the row order from the local data sharing unit in parallel according to the current address of the logic change register.

Wherein the first matrix needs to be stored in the LDS in advance, therefore, before step S101, the method further includes: storing elements in the first matrix to the local data sharing unit. In one embodiment, the matrix multiplier further comprises a controller connected to the local data sharing unit. At this time, the elements in the first matrix may be stored to the local data sharing unit in a row order by a controller.

Furthermore, as an embodiment, each vector stream processor may perform subsequent processing such as obtaining pre-stored elements from the second matrix from the corresponding vector general purpose register in parallel after receiving a multiplication instruction for multiplying the first matrix with the second matrix, for example, after each of the K vector stream processors receives a multiplication instruction sent by the controller. Wherein the controller is connected to each of the vector stream processors, and the controller sends multiplication instructions to the K vector stream processors in parallel (simultaneously) to instruct the K vector stream processors to multiply the first matrix with the second matrix. That is, before step S101, the method further includes: the controller sends multiplication instructions to the K vector stream processors in parallel to instruct the K vector stream processors to multiply the first matrix with the second matrix. Of course, the multiplication operation of the first matrix and the second matrix may be triggered by other means, such as by timing.

Step S102: the K vector stream processors fetch the pre-stored corresponding elements from the second matrix in parallel from the respective corresponding vector general purpose registers.

Multiplication by a matrix_64x64*B_64x64＝C_64x64For example, if 64 VSPs are obtained from LDS in parallel as element A₁₁Then 64 VSPs obtain the corresponding elements from the second matrix stored in advance from the respective corresponding vector general purpose registers in parallel, i.e. the first row of elements in table 2, and further VSP1 obtains B from the directly connected VGPR1₁₁VSP2 obtains B from directly connected VGPR2₁₂VSP3 obtains B from directly connected VGPR3₁₃… … VSP64 obtains B from directly connected VGPR64₁₆₄。

Wherein the second matrix needs to be stored in K VGPR in advance, therefore, before step S102, the method further includes: and correspondingly storing each column in the second matrix into the K vector general registers, wherein each vector general register stores one column of the second matrix, namely one vector general register stores one column, and the columns stored by different vector general registers are different. In one embodiment, the matrix multiplier further comprises a controller coupled to each of the K vector general purpose registers via a bus. At this time, the controller may be used to store each column in the second matrix into the K vector general registers correspondingly. The storage is as shown in table 2 above.

Step S103: the K vector stream processors each multiply the acquired elements from the first matrix with corresponding elements from the second matrix.

For example, for VSP1, element A from the first matrix is taken₁₁Corresponding to element B from the second matrix₁₁Multiplying, from element A in the first matrix₁₂Corresponding to element B from the second matrix₂₁Multiplication, … …, of the element A from the first matrix₁₆₄Corresponding to element B from the second matrix₆₄₁Multiplication.

Step S104: and the K vector stream processors accumulate multiplication results generated by the elements in the same row of the first matrix and the corresponding elements of the second matrix one by one in parallel in sequence to obtain all elements in the same row of a third matrix.

For example, for VSP1, the multiplication results of the elements in the same row of the first matrix and the corresponding elements of the second matrix are sequentially accumulated, i.e. the multiplication results of the elements in the first row of the first matrix and the corresponding elements of the second matrix are sequentially accumulated to obtain C₁₁I.e. VSP1: C₁₁＝A₁₁*B₁₁+A₁₂*B₂₁+A₁₃*B₃₁+A₁₄*B₄₁+…+A₁₆₄*B₆₄₁；

Similarly, for VSP2, the multiplication results of the elements in row 1 from the first matrix and the corresponding elements in the second matrix are sequentially accumulated to obtain C₁₂I.e. VSP2: C₁₂＝A₁₁*B₁₂+A₁₂*B₂₂+A₁₃*B₃₂+A₁₄*B₄₂+…+A₁₆₄*B₆₄₂(ii) a Since K VSPs are processed in parallel, theThis allows to obtain all elements of the same row of the third matrix, e.g. all elements of the first row of the third matrix.

In addition, in order to reduce the occupation of VSP memory, after the K vector stream processors sequentially accumulate the multiplication results generated by the elements in the same row of the first matrix and the corresponding elements of the second matrix one by one in parallel, that is, after all the elements in the same row of the third matrix are obtained, the method further includes: the K vector stream processors store the accumulated results in parallel row-wise order to areas of the LDS that do not overlap the first matrix. For example, VSP1 is obtaining C₁₁Then, adding C₁₁Address1, VSP2 stored in LDS gets C₁₂Then, adding C₁₂Address2, … …, VSP64 stored in LDS gets C₁₆₄Then, adding C₁₆₄The addresses 64 stored in the LDS are all regions where the addresses 1-64 do not overlap with the regions where the unread elements in the first matrix are located. Of course, the K vector stream processors may store the accumulated result in parallel in the corresponding VGPR, and in the area not overlapping with the second matrix.

The data processing method provided by the embodiment of the present application has the same implementation principle and technical effect as the matrix multiplier, and for the sake of brief description, reference may be made to the corresponding contents in the foregoing device embodiments for the parts that are not mentioned in the method embodiments.

Embodiments of the present application also provide an integrated circuit device that includes a substrate and a matrix multiplier disposed on the substrate. The substrate may be a circuit substrate commonly used at present, such as a PCB board. It should be noted that, since the local share unit LDS can implement data sharing, two or more matrix multipliers can share one local share unit LDS, for example, it is necessary to calculate the matrix a × matrix B and the matrix a × matrix C, at this time, two matrix multipliers share one local share unit LDS, that is, the elements in the matrix a are stored in the LDS according to the row sequence, and when performing matrix calculation, the elements stored in the local share unit LDS are loaded in parallel to K vector stream processors in the first matrix multiplier one by one and K vector stream processors in the second matrix multiplier in parallel. Accordingly, the integrated circuit device may also not comprise the element LDS in the matrix multiplier, i.e. the LDS is not integrated in the integrated circuit device, but exists separately.

The embodiment of the present application further provides a processor at least including the integrated circuit device, where the processor may be a general-purpose processor, such as a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a microprocessor, and the like; it may also be an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable gate array (FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware component.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A matrix multiplier, comprising:

the local data sharing unit is used for storing a first matrix according to the row sequence, wherein the first matrix is an M x N matrix;

the vector general registers are used for storing each column in a second matrix, each vector general register stores one column of the second matrix, the second matrix is an N x K matrix, and K is an integer greater than or equal to 2;

the local data sharing unit is connected with each vector stream processor in the K vector stream processors through a bus, so that elements in the first matrix are loaded to the K vector stream processors one by one in parallel and multiplied by elements corresponding to columns stored in the K vector stream processors respectively, and the K vector stream processors accumulate multiplication results generated by the elements in the same row of the first matrix and the corresponding elements of the second matrix one by one in parallel to obtain all elements in the same row of a third matrix, thereby finishing multiplication operation of the first matrix and the second matrix.

2. The matrix multiplier according to claim 1, wherein the matrix multiplier further comprises:

and the logic change register is connected with each vector stream processor and is used for storing an address for reading each element in the first matrix, and automatically updating to an address corresponding to the next element after the K vector stream processors read the corresponding element in the first matrix from the local data sharing unit according to the current address of the logic change register in parallel.

3. The matrix multiplier according to claim 1, further comprising a controller coupled to each of the vector general purpose registers, the controller configured to send multiplication instructions to the K vector stream processors in parallel to instruct the K vector stream processors to multiply the first matrix with the second matrix.

4. The matrix multiplier of claim 3 wherein said controller is further coupled to said local data sharing unit, each of said vector general purpose registers, respectively, said controller further for storing elements of said first matrix in row order to said local data sharing unit, and further for correspondingly storing columns of said second matrix in column order to said K vector general purpose registers.

5. A data processing method applied to a matrix multiplier, the matrix multiplier comprising: the system comprises a local data sharing unit, K vector general registers and K vector stream processors, wherein the K vector stream processors are connected with the K vector general registers in a one-to-one correspondence mode, and the local data sharing unit is connected with each of the K vector stream processors through a bus; the method comprises the following steps:

the K vector stream processors parallelly acquire elements in a first matrix stored in advance one by one from the local data sharing unit according to a row sequence;

the K vector stream processors parallelly acquire corresponding elements from a second matrix which are stored in advance from corresponding vector general registers;

the K vector stream processors multiply the acquired elements from the first matrix with corresponding elements from the second matrix respectively;

and the K vector stream processors accumulate multiplication results generated by the elements in the same row of the first matrix and the corresponding elements of the second matrix one by one in parallel in sequence to obtain all elements in the same row of a third matrix.

6. The method of claim 5, wherein the matrix multiplier further comprises a logical change register coupled to each of the vector stream processors for storing an address to read each element in the first matrix and for automatically updating to a next element corresponding address after each of the vector stream processors reads the corresponding element in the first matrix from the local data sharing unit in parallel according to a current address of the logical change register; the K vector stream processors parallelly acquire elements in a prestored first matrix one by one according to a row sequence from the local data sharing unit, and the method comprises the following steps:

and the K vector stream processors parallelly and sequentially acquire the elements in the pre-stored first matrix from the local data sharing unit according to the current address of the logic change register in a row sequence.

7. The method of claim 5 or 6, wherein the matrix multiplier further comprises: a controller connected to the local data sharing unit, before the K vector stream processors fetch elements of a pre-stored first matrix one by one in row order from the local data sharing unit in parallel, the method further comprising:

the controller stores the elements in the first matrix to the local data sharing unit in row order.

8. The method of claim 5 or 6, wherein the matrix multiplier further comprises: a controller respectively connected to each of the K vector general purpose registers via a bus, the method further comprising, before the K vector stream processors fetch in parallel from the respective corresponding vector general purpose registers the pre-stored corresponding elements from the second matrix:

and the controller correspondingly stores each column in the second matrix into the K vector general registers according to the column sequence, and each vector general register stores one column of the second matrix.

9. The method of claim 5, wherein the matrix multiplier further comprises: a controller connected to each of the vector stream processors, the K vector stream processors being in parallel before fetching elements of a pre-stored first matrix one by one in row order from the local data sharing unit, the method further comprising:

the controller sends multiplication instructions to the K vector stream processors in parallel to instruct the K vector stream processors to multiply the first matrix with the second matrix.

10. The method of claim 5, wherein after the K vector stream processors sequentially accumulate, in parallel, multiplication results produced by elements in the same row of the first matrix one by one with corresponding elements of the second matrix, the method further comprises:

the K vector stream processors store accumulated results in parallel in row order to a region of the local data sharing unit that does not overlap the first matrix.

11. An integrated circuit device, comprising: a substrate and a matrix multiplier as claimed in any one of claims 1 to 4 disposed on the substrate.

12. A processor, comprising: the integrated circuit device of claim 11.