Disclosure of Invention
The application aims to provide a matrix multiplier, a data processing method, an integrated circuit device and a processor.
The embodiment of the application is realized as follows:
in a first aspect, an embodiment of the present application provides a matrix multiplier, including: the system comprises a local data sharing unit, K vector general registers and K vector stream processors which are connected with the K vector general registers in a one-to-one corresponding mode; the local data sharing unit is used for storing a first matrix according to the row sequence, wherein the first matrix is an M x N matrix; the vector general registers are used for storing each column in a second matrix, each vector general register stores one column of the second matrix, the second matrix is an N x K matrix, and K is an integer greater than or equal to 2; the local data sharing unit is connected with each of the K vector stream processors through a bus, so that elements in the first matrix are loaded to the K vector stream processors one by one in parallel and are multiplied by elements corresponding to columns stored in the K vector general registers respectively, the K vector stream processors accumulate multiplication results generated by the elements in the same row of the first matrix and the corresponding elements of the second matrix one by one in parallel to obtain all elements in the same row of a third matrix, and multiplication operation of the first matrix and the second matrix is completed.
In the embodiment of the application, the local data sharing unit is connected with each vector stream processor through the bus, and through the path, the elements in the first matrix stored in the local data sharing unit can be directly loaded into the K vector stream processors in parallel, so that the loading operation of loading data from the local data sharing unit → the vector general register → the vector stream processor is omitted, additional read-write operation is reduced, the problem of occupation of VGPR space is optimized, meanwhile, the matrix multiplier can perform parallel calculation on all elements in the same row of the third matrix through the path, the number of times of obtaining the elements from the first matrix is greatly reduced, and further the system overhead can be reduced.
With reference to one possible implementation manner of the embodiment of the first aspect, the matrix multiplier further includes: and the logic change register is connected with each vector stream processor and is used for storing an address for reading each element in the first matrix, and automatically updating to an address corresponding to the next element after the K vector stream processors read the corresponding element in the first matrix from the local data sharing unit according to the current address of the logic change register in parallel. In the embodiment of the present application, the logic change register may automatically update the address corresponding to the next element after the vector stream processor reads the corresponding element from the local data sharing unit according to the current address, and the vector stream processor does not need to actively update the address.
With reference to a possible implementation manner of the embodiment of the first aspect, the matrix multiplier further includes a controller connected to each of the vector general purpose registers, and the controller is configured to send a multiplication instruction to the K vector stream processors in parallel to instruct the K vector stream processors to multiply the first matrix with the second matrix. The controller sends multiplication instructions to the K vector stream processors in parallel (simultaneously) to instruct the K vector stream processors to multiply the first matrix and the second matrix, so that the K vector stream processors can synchronously carry out corresponding operations.
With reference to a possible implementation manner of the embodiment of the first aspect, the controller is further connected to the local data sharing unit and each of the vector general purpose registers, and the controller is further configured to store elements in the first matrix to the local data sharing unit according to a row order, and further configured to correspondingly store columns in the second matrix to the K vector general purpose registers according to a column order. In the embodiment of the application, the controller stores the elements in the first matrix into the local data sharing unit according to the row sequence, and correspondingly stores each column in the second matrix into the K vector general registers, so that when the first matrix and the second matrix are subjected to multiplication, calculation can be performed on all elements in the same row of the third matrix, the number of times of obtaining the elements from the first matrix is greatly reduced, and further, the system overhead can be reduced.
In a second aspect, an embodiment of the present application further provides a data processing method, which is applied to a matrix multiplier, where the matrix multiplier includes: the system comprises a local data sharing unit, K vector general registers and K vector stream processors, wherein the K vector stream processors are connected with the K vector general registers in a one-to-one correspondence mode, and the local data sharing unit is connected with each of the K vector stream processors through a bus; the method comprises the following steps: the K vector stream processors parallelly acquire elements in a first matrix stored in advance one by one from the local data sharing unit according to a row sequence; the K vector stream processors parallelly acquire corresponding elements from a second matrix which are stored in advance from corresponding vector general registers; the K vector stream processors multiply the acquired elements from the first matrix with corresponding elements from the second matrix respectively; and the K vector stream processors accumulate multiplication results generated by the elements in the same row of the first matrix and the corresponding elements of the second matrix one by one in parallel to obtain all the elements in the same row of the third matrix.
In combination with a possible implementation manner of the embodiment of the second aspect, the matrix multiplier further includes a logic change register connected to each of the vector stream processors, and configured to store an address for reading each element in the first matrix, and automatically update to an address corresponding to a next element after each of the vector stream processors reads a corresponding element in the first matrix from the local data sharing unit according to a current address of the logic change register in parallel; the K vector stream processors parallelly acquire elements in a prestored first matrix one by one according to a row sequence from the local data sharing unit, and the method comprises the following steps: and the K vector stream processors parallelly and sequentially acquire the elements in the pre-stored first matrix from the local data sharing unit according to the current address of the logic change register in a row sequence.
With reference to one possible implementation manner of the embodiment of the second aspect, the matrix multiplier further includes: a controller connected to the local data sharing unit, before the K vector stream processors fetch elements of a pre-stored first matrix one by one in row order from the local data sharing unit in parallel, the method further comprising: the controller stores the elements in the first matrix to the local data sharing unit in row order.
With reference to one possible implementation manner of the embodiment of the second aspect, the matrix multiplier further includes: a controller respectively connected to each of the K vector general purpose registers via a bus, the method further comprising, before the K vector stream processors fetch in parallel from the respective corresponding vector general purpose registers the pre-stored corresponding elements from the second matrix: and the controller correspondingly stores each column in the second matrix into the K vector general registers according to the column sequence, and each vector general register stores one column of the second matrix.
With reference to one possible implementation manner of the embodiment of the second aspect, the matrix multiplier further includes: a controller connected to each of the vector stream processors, the K vector stream processors being in parallel before fetching elements of a pre-stored first matrix one by one in row order from the local data sharing unit, the method further comprising: the controller sends multiplication instructions to the K vector stream processors in parallel to instruct the K vector stream processors to multiply the first matrix with the second matrix.
In combination with a possible implementation manner of the embodiment of the second aspect, after the K vector stream processors sequentially accumulate, in parallel, multiplication results generated by elements in a same row of the first matrix and corresponding elements of the second matrix one by one, the method further includes: the K vector stream processors store accumulated results in parallel in row order to a region of the local data sharing unit that does not overlap the first matrix. In the embodiment of the application, the K vector stream processors store the accumulated sum results into the local data sharing unit in parallel according to the row sequence in an area which is not overlapped with the first matrix, so as to reduce the occupation of the memory of the vector stream processors.
In a third aspect, an embodiment of the present application further provides an integrated circuit device, including a substrate and a matrix multiplier, provided on the substrate, as provided in the embodiment of the first aspect and/or in combination with any one of the possible implementations of the embodiment of the first aspect.
In a fourth aspect, an embodiment of the present application further provides a processor, which includes the integrated circuit device provided in the foregoing third aspect.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and drawings.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, relational terms such as "first," "second," and the like may be used solely in the description herein to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without K being more limiting, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
Further, the term "and/or" in the present application is only one kind of association relationship describing the associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone.
In the first conventional method, both the matrix a and the matrix B are pre-loaded into a Vector General Purpose Register (VGPR), and when multiplication is performed, the row of the matrix a and the column of the matrix B are taken for operation. The scheme needs to pre-load the whole matrix A and the matrix B into the VGPR, a large amount of VGPR space is wasted, the VGPR space is limited, the size of the matrix needs to be limited, and meanwhile, the system performance is poor due to the fact that a large amount of VGPR space is wasted.
In the second conventional method, both the matrix a and the matrix B are preloaded into a Local data share unit (LDS), and when multiplication is performed, the matrix a and the matrix B are loaded into the VGPR, and then multiplication is performed. Although the scheme can save some VGPR space, a large amount of LDS space is needed, two times of additional reading and writing from LDS to VGPR are added, the power consumption is increased due to the additional reading and writing operation, and the performance is also reduced.
In the third conventional method, the matrix a is preloaded to the LDS, the matrix B is preloaded to the VGPR, and when a × B is performed, the matrix a is preloaded to the VGPR line by line, and then multiplication is performed. Although this scheme may save some VGPR space and also does not require loading the entire matrix a to VGPR, there are still a lot of additional read and write operations with respect to matrix a, e.g., writing matrix a to LDS, reading matrix a from LDS, writing matrix a to VGPR, and reading matrix a from VGPR. The extra read and write operations consume a lot of power consumption, so the problem with this solution is that the power consumption is too high.
Through research and analysis, the matrix A is preloaded to the LDS, and the matrix B is preloaded to the VGPR, so that VGPR resources are saved, all hardware resources are used comprehensively, and the problems that the number of required read-write operations is large and the VGPR space is occupied in the existing computing mode are solved. According to the scheme, the matrix A is directly broadcast to a Vector Stream Processor (VSP) by using the LDS _ DIRECT path, and the loading operation from LDS → VGPR → VSP is omitted, so that no additional reading and writing operation is required, and the power consumption performance is good. The matrix multiplier and the data processing method thereof according to the embodiments of the present application will be explained below.
Referring to fig. 1, a schematic structural diagram of a matrix multiplier according to an embodiment of the present application is shown, and a structure of the matrix multiplier will be described with reference to fig. 1. The matrix multiplier includes: a Local Data Share unit (LDS), a plurality of Vector General Purpose Registers (VGPR), and a plurality of Vector Stream Processors (VSP) connected in a one-to-one correspondence with the plurality of Vector General Purpose registers. The local data sharing unit LDS may be a Random Access Memory (RAM), a register array, or the like.
The local data sharing unit is configured to store a first matrix (e.g., matrix a) in a row order, where M, N is an integer greater than or equal to 1 when the first matrix is an M × N matrix, and store the matrix a in a row order when the matrix a is loaded to the LDS, where the matrix a is stored, for example, a11、A12、……、A1N-1、A1N;A21、A22、……、A2N-1、A2N;……;AM1、AM2、……、AMN-1、AMN。
A plurality of vector general purpose registers VGPR for storing respective columns in a second matrix (e.g., matrix B), each vector general purpose register storing one column of the second matrix, i.e., one vector general purpose register storing one column, and the different vector general purpose registers storing different columns. It should be noted that the number of columns of the second matrix is less than or equal to the number of vector general purpose registers, for example, when the second matrix is an N × K matrix, the number of vector general purpose registers is greater than or equal to K, and K is an integer greater than or equal to 2. When loading matrix B into K VGPRs, one VGPR stores one column and the columns stored by different VGPRs are different, e.g., the column stored by the first VGPR is: b is11、B21、……、BN-11、BN1(ii) a The second VGPR stores a column as: b is12、B22、……、BN-12、BN2(ii) a The column that the K-1 VGPR stores is: b is1K-1、B2K-1、……、BN-1K-1、BNK-1(ii) a The column stored by the Kth VGPR is: b is1K、B2K、……、BN-1K、BNK。
The vector stream processors are connected with the vector general registers in a one-to-one correspondence mode, namely one vector general register corresponds to one vector stream processor, so that the vector stream processors can conveniently acquire data from the corresponding vector general registers.
The local data sharing unit is connected to each of the plurality of vector stream processors through a bus (LDS-Direct in the figure) so that the elements in the first matrix can be loaded into the plurality of vector stream processors one by one in parallel. Since the second matrix in this embodiment is an N × K matrix, only K vector general purpose registers are needed to store each column in the second matrix, and therefore, in the following description, only K vector general purpose registers and K vector stream processors are used for description (it is understood that the number of the vector general purpose registers and the number of the vector stream processors may be greater than or equal to K). At this time, the local data sharing unit is connected with each of the K vector stream processors through the bus, so that the elements in the first matrix are loaded to the K vector stream processors one by one in parallel, and are multiplied by the elements corresponding to the columns stored in the K vector general registers, and the K vector stream processors accumulate the multiplication results generated by the elements in the same row of the first matrix and the corresponding elements of the second matrix one by one in parallel, that is, each vector stream processor accumulates the multiplication results generated by the elements in the same row of the first matrix and the corresponding elements of the second matrix one by one in sequence, so as to obtain all the elements in the same row of the third matrix, thereby completing the multiplication operation of the first matrix and the second matrix.
For ease of understanding, A is multiplied in a matrix below64x64*B64x64=C64x64That is, M, N, K is explained as being equal to 64, but it is to be understood that 64x64 is merely an example and not limited thereto. It should be noted that, the multiplication of the two matrices requires that the number of columns (Column) of the first matrix is the same as the number of rows (Row) of the second matrix, and only the number of columns (Column) of the first matrix is the same as the number of rows (Row) of the second matrixIt makes sense that the rows (Row) of the two matrices are the same. For example, the first matrix is an M × N matrix and the second matrix is an N × K matrix. In performing the multiplication, 64 VSPs read the elements of matrix A (A) from LDS in parallel11,A12,…,A164,A21,A22,…,A6464) And obtaining corresponding elements in matrix B from respective corresponding VGPR in parallel, each of the 64 VSPs multiplying the obtained elements from the first matrix with corresponding elements from the second matrix; each of the 64 VSPs (64 VSPs are executed in parallel) sequentially accumulates the multiplication results of the elements in the same row of the matrix a with the corresponding elements of the second matrix one by one, resulting in all the elements in the same row of the matrix C. The calculation process can be represented by table 1.
TABLE 1
As can be seen from Table 1 above, at time CLK1, A11Are loaded in parallel into 64 VSPs, multiplied by the elements corresponding to the columns stored in each of the 64 VGPRs, and at time CLK2, A12Are loaded in parallel into 64 VSPs, multiplied by the elements corresponding to the columns stored in each of the 64 VGPRs, since A11And A12Belonging to the same row of the first matrix, so that each VSP adds the multiplication results corresponding to each element from the same row of said first matrix, namely A at time CLK212*B21+C11The calculation principle is the same at the subsequent time. It should be noted that, for the same Stage, VSP1 is taken as an example, and C in the current time is11Indicating the result of the previous time calculation, i.e. C11E.g. C at time CLK211Representative is C of CIK111E.g. C at time CLK311Representative is C of CIK211… …, e.g. C at time CLK6411Representative is C of CIK6311. It can be seen that one Stage is used to calculate all elements of the same row of the third matrix, each Stage containing 64 CLK (here for the example A64x64*B64x64=C64x64So one Stage contains 64 CLKs), each CLK reads one element of matrix a. For example, Stage1 is used to calculate the 1 st row of matrix C, Stage1 is used to calculate the 2 nd row of matrix C, and so on, specifically, taking the first row of matrix C as an example, there are:
VSP1:C11=A11*B11+A12*B21+A13*B31+A14*B41+…+A164*B641;
VSP2:C12=A11*B12+A12*B22+A13*B32+A14*B42+…+A164*B642;
VSP3:C13=A11*B13+A12*B23+A13*B33+A14*B43+…+A164*B643;
……
VSP64:C164=A11*B164+A12*B264+A13*B364+A14*B464+…+A164*B6464;
it will be readily apparent that each element in A will be loaded in parallel into the various VSPs, such as A described above11、A12、A13And so on, by the elements corresponding to the columns stored in each of the 64 VGPRs, as described above A11*B11、A11*B12、A11*B13、…、A11*B164And then each VSP sequentially accumulates the multiplication results generated by the elements in the same row of the first matrix and the corresponding elements of the second matrix one by one to obtain all the elements in the same row of the third matrix, and the VSP1 sequentially accumulates the multiplication results generated by the elements in the same row of the first matrix and the corresponding elements of the second matrix one by one to obtain C11(ii) a VSP2 accumulates the multiplication results of elements in the same row of the first matrix and corresponding elements of the second matrix one by one to obtain C12(ii) a VSP64 accumulates the multiplication results of elements in the same row of the first matrix and corresponding elements of the second matrix one by one to obtain C164。
As can be seen from the above example, in the embodiment of the present application, when calculating the elements in the third matrix C, all elements in the same row of the third matrix are calculated simultaneously, which is different from the prior art in which one element is calculated separately, for example, the prior art calculates C completely11Then, calculate C again12… in addition, in the prior art, when calculating the matrix C, it is necessary to load both the matrix a and the matrix B into the VGPR, and when performing matrix multiplication, vector dot product is directly performed. To calculate C11For example, C11=A11*B11+A12*B21+A13*B31+A14*B41+…+A164*B641. Note that this operation of taking two operands from the VGPR is calculated for each element of the product matrix C. For example, C is calculated sequentially in the above calculation manner11,C12,C13The calculation order may be row by row, column by column, etc. without limitation.
It can be seen that the calculation method in the present application can greatly reduce the number of times of obtaining elements from the matrix a, for example, when all elements in the first row of the matrix C are calculated, each element in the first row of the matrix a only needs to be obtained once, 64 elements in total, and only 64 times, whereas in the prior art, each time one element in the first row of the matrix C is calculated, all elements in the first row of the matrix a need to be obtained once, and when the calculation of all elements (64 elements in total) in the first row of the matrix C is completed, all elements in the first row of the matrix a need to be obtained 64 times repeatedly, so 64 times of obtaining are needed. The number of acquisitions required in the calculation of the other rows of the matrix C is the same as the number of acquisitions required for the calculation of all the elements of the first row, so that the multiplication of the first matrix and the second matrix, i.e. a, is completed64x64*B64x64=C64x64In the embodiment of the present application, the total number of times of the matrix acquiring the elements from the matrix a is 64 × 64, and the existing calculation method is adoptedThe required times are 64 × 64, so that the times of obtaining elements from the system are greatly reduced, the power consumption of the system is reduced, and the performance is enhanced. In addition, in the embodiment of the application, since the LDS is connected with each VSP through the bus, each VSP can directly acquire all elements in the matrix a from the LDS, and through this path, the operation of loading data from LDS → VGPR → VSP is omitted. In the prior art, the elements in the LDS need to be loaded into the VGPR first, and then the elements are obtained from the VGPR, which further increases additional read-write operations. It should be noted that the number of read operations is for one VSP. The defects existing in the prior art are the results obtained after the inventor practices and researches, so that the discovery process of the above problems and the solution proposed by the embodiments of the present application in the present application to the above problems should be the contribution of the inventor to the present application in the process of the present application.
The matrix multiplier shown in the present application can minimize the use of VGPR, which includes only matrix B, i.e., 64x64 elements, and LDS, which includes only matrix a, i.e., 64x64 elements. The present application can also minimize access to VSPs: it is only necessary to read matrix a from LDS to VSP, which includes 64x64 accesses, and matrix B from VGPR to VSP, which includes 64x64, and similarly, the read operations of VGPR are compressed to a minimum, which includes only matrix B accesses, for a total of 64x64x64 reads.
How to efficiently perform matrix multiplication is crucial to many computer applications, so in this application, for the matrix a and the matrix B performing matrix multiplication, one of the matrices, i.e., a first matrix such as the matrix a, is stored in the local data sharing unit in advance according to the row sequence, and the other matrix, i.e., a second matrix such as the matrix B, is stored in K vector general purpose registers, where each vector general purpose register stores one column of the second matrix, i.e., one vector general purpose register stores one column, and the columns stored by different vector general purpose registers are different. When matrix multiplication is carried out, elements in a first matrix are loaded to K vector stream processors one by one in parallel and are multiplied by elements corresponding to columns stored in K vector general registers respectively, and the K vector stream processors accumulate multiplication results generated by the elements in the same row of the first matrix and the corresponding elements of a second matrix one by one in parallel to obtain all elements in the same row of a third matrix, so that the multiplication operation of the first matrix and the second matrix is completed.
When the elements in the first matrix are stored in the local data sharing unit in the row sequence, each element corresponds to a unique address, so that each VSP acquires the element corresponding to the address from the local data sharing unit according to the address during multiplication. For example, A11→LDS(Address1);A12→LDS(Address2);A13→ LDS (Address 3); … it should be noted that the addresses corresponding to each element are different, and the addresses corresponding to the elements in the same row can be continuously increased as in the above example, or continuously decreased, such as a11→LDS(Address4096);A12→LDS(Address4095)…A6464→ LDS (Address 1). Furthermore, it may be discontinuous, such as 1, 3, 5, 7, … …, or discontinuous, such as 1, 2, 4, 7, 11, 16, and thus the above examples should not be construed as limiting the present application.
Because each element corresponds to an Address, after an element is obtained according to the current Address, the current Address needs to be updated to the Address corresponding to the next element, for example, the Address A is obtained according to the current Address111Then, the current Address needs to be updated to Address2, and if the VSP is used to actively update the Address, the Address needs to be updated once after each time an element in matrix a is acquired, which is certainly very time-consuming, and therefore, as an embodiment, in order to improve the efficiency of acquiring the element in matrix a, the matrix multiplier further includes a logic change register, where M0 in fig. 1 represents the logic change register. The logic change register is connected with each vector stream processor and used for storing and reading the address of each element in the first matrix and automatically updating to the next vector stream processor after the K vector stream processors read the corresponding element in the first matrix from the local data sharing unit according to the current address of the logic change register in parallelThe address to which the element corresponds. The logic change register obtains A in the K vector stream processors in parallel according to the current Address of the logic change register, such as Address111The Address is then automatically updated to the Address corresponding to the next element, such as to Address 2.
To facilitate storing elements in the first matrix to the local data sharing unit and storing respective columns in the second matrix to the vector general purpose register, the matrix multiplier further comprises: a controller as shown in fig. 2. The controller is respectively connected with the local data sharing unit and each vector general register. The controller is used for storing the elements in the first matrix to the local data sharing unit according to the row sequence, and correspondingly storing each column in the second matrix to the K vector general registers according to the column sequence, wherein the storage format is shown in Table 2.
TABLE 2
VGPR1
|
VGPR2
|
……
|
VGPR64
|
B11 |
B12 |
……
|
B164 |
B21 |
B22 |
……
|
B264 |
……
|
……
|
……
|
……
|
B641 |
B642 |
……
|
B6464 |
The controller is also coupled to each of the vector general purpose registers, and is further configured to send multiplication instructions in parallel to the K vector stream processors to instruct the K vector stream processors to multiply the first matrix with the second matrix. As in A above64x64*B64x64=C64x64For example, the controller sends a multiplication instruction to 64 VSPs at the same time, so that the 64 VSPs obtain elements in a first matrix stored in advance one by one in a row sequence from the local data sharing unit in parallel, obtain corresponding elements in a second matrix from respective corresponding vector general registers in parallel, multiply the obtained elements from the first matrix with the corresponding elements from the second matrix, and finally accumulate multiplication results generated by the elements in the same row of the first matrix and the corresponding elements of the second matrix one by one in sequence to obtain all elements in the same row of the third matrix, thereby completing multiplication operations of the first matrix and the second matrix.
Furthermore, in order to reduce the occupation of VSP memory, after the K vector stream processors sequentially accumulate the multiplication results generated by the elements in the same row of the first matrix one by one with the corresponding elements of the second matrix in parallel, that is, after all the elements in the same row of the third matrix are obtained, the K vector stream processors store the accumulated results (the total result after the addition) in parallel in a row order to an area of the LDS that does not overlap with the first matrix, for example, the VSP1 in obtaining C11Then, adding C11Address1, VSP2 stored in LDS gets C12Then, adding C12Address2, … …, VSP64 stored in LDS gets C164Then, adding C164The addresses 64 stored in the LDS are all regions where the addresses 1-64 do not overlap with the regions where the unread elements in the first matrix are located. Of course, the K vector stream processors can store the accumulated result in parallel in the corresponding VGPR, and in the area not overlapping with the second matrix, for example, VSP1 is obtaining C11Then, adding C11In the region stored in VGPR1 and not overlapping with column 1 of the second matrix, VSP2 is obtaining C12Then, adding C12The region stored in VGPR2 and not overlapping with column 2 of the second matrix, … …, VSP64 is getting C164Then, adding C164The region stored in VGPR64 that does not overlap with column 64 of the second matrix.
The matrix multiplier according to the present invention can be applied to a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), or a circuit device capable of independently performing operations, and those skilled in the art should understand that the scope of the present invention is not limited thereto, and any way of implementing the matrix multiplier according to the present invention that is known to those skilled in the art and can be easily conceived of as a change or an alternative within the technical scope of the present invention should be covered by the scope of the present invention.
The embodiment of the present application further provides a data processing method applied to the matrix multiplier, which will be described below with reference to the flowchart shown in fig. 3.
Step S101: the K vector stream processors fetch elements in a pre-stored first matrix one by one in row order from the local data sharing unit in parallel.
Multiplication by a matrix64x64*B64x64=C64x64For example, 64 VSPs acquire the elements (A) of the pre-stored first matrix one by one in row order from LDS in parallel11,A12,…,A164,A21,A22,…,A6464). Since each element in the first matrix corresponds to a unique address when being stored in the LDS, each VSP can obtain the element corresponding to the address from the local data sharing unit according to the address. Because each element corresponds to an Address, after an element is obtained according to the current Address, the current Address needs to be updated to the Address corresponding to the next element, for example, the Address A is obtained according to the current Address111Then, the current Address needs to be updated to Address2, and if the VSP is used to actively update the Address, the Address needs to be updated once after each element in the matrix a is acquired, which is very time consuming. The logic change register is connected with each vector stream processor and used for storing and reading the address of each element in the first matrix, and after the K vector stream processors read the corresponding element in the first matrix from the local data sharing unit in parallel according to the current address of the logic change register, the logic change register automatically updates the address corresponding to the next element. Correspondingly, the K vector stream processors acquire the elements in the pre-stored first matrix one by one in the row order from the local data sharing unit in parallel, and specifically, the K vector stream processors acquire the elements in the pre-stored first matrix one by one in the row order from the local data sharing unit in parallel according to the current address of the logic change register.
Wherein the first matrix needs to be stored in the LDS in advance, therefore, before step S101, the method further includes: storing elements in the first matrix to the local data sharing unit. In one embodiment, the matrix multiplier further comprises a controller connected to the local data sharing unit. At this time, the elements in the first matrix may be stored to the local data sharing unit in a row order by a controller.
Furthermore, as an embodiment, each vector stream processor may perform subsequent processing such as obtaining pre-stored elements from the second matrix from the corresponding vector general purpose register in parallel after receiving a multiplication instruction for multiplying the first matrix with the second matrix, for example, after each of the K vector stream processors receives a multiplication instruction sent by the controller. Wherein the controller is connected to each of the vector stream processors, and the controller sends multiplication instructions to the K vector stream processors in parallel (simultaneously) to instruct the K vector stream processors to multiply the first matrix with the second matrix. That is, before step S101, the method further includes: the controller sends multiplication instructions to the K vector stream processors in parallel to instruct the K vector stream processors to multiply the first matrix with the second matrix. Of course, the multiplication operation of the first matrix and the second matrix may be triggered by other means, such as by timing.
Step S102: the K vector stream processors fetch the pre-stored corresponding elements from the second matrix in parallel from the respective corresponding vector general purpose registers.
Multiplication by a matrix64x64*B64x64=C64x64For example, if 64 VSPs are obtained from LDS in parallel as element A11Then 64 VSPs obtain the corresponding elements from the second matrix stored in advance from the respective corresponding vector general purpose registers in parallel, i.e. the first row of elements in table 2, and further VSP1 obtains B from the directly connected VGPR111VSP2 obtains B from directly connected VGPR212VSP3 obtains B from directly connected VGPR313… … VSP64 obtains B from directly connected VGPR64164。
Wherein the second matrix needs to be stored in K VGPR in advance, therefore, before step S102, the method further includes: and correspondingly storing each column in the second matrix into the K vector general registers, wherein each vector general register stores one column of the second matrix, namely one vector general register stores one column, and the columns stored by different vector general registers are different. In one embodiment, the matrix multiplier further comprises a controller coupled to each of the K vector general purpose registers via a bus. At this time, the controller may be used to store each column in the second matrix into the K vector general registers correspondingly. The storage is as shown in table 2 above.
Step S103: the K vector stream processors each multiply the acquired elements from the first matrix with corresponding elements from the second matrix.
For example, for VSP1, element A from the first matrix is taken11Corresponding to element B from the second matrix11Multiplying, from element A in the first matrix12Corresponding to element B from the second matrix21Multiplication, … …, of the element A from the first matrix164Corresponding to element B from the second matrix641Multiplication.
Step S104: and the K vector stream processors accumulate multiplication results generated by the elements in the same row of the first matrix and the corresponding elements of the second matrix one by one in parallel in sequence to obtain all elements in the same row of a third matrix.
For example, for VSP1, the multiplication results of the elements in the same row of the first matrix and the corresponding elements of the second matrix are sequentially accumulated, i.e. the multiplication results of the elements in the first row of the first matrix and the corresponding elements of the second matrix are sequentially accumulated to obtain C11I.e. VSP1: C11=A11*B11+A12*B21+A13*B31+A14*B41+…+A164*B641;
Similarly, for VSP2, the multiplication results of the elements in row 1 from the first matrix and the corresponding elements in the second matrix are sequentially accumulated to obtain C12I.e. VSP2: C12=A11*B12+A12*B22+A13*B32+A14*B42+…+A164*B642(ii) a Since K VSPs are processed in parallel, theThis allows to obtain all elements of the same row of the third matrix, e.g. all elements of the first row of the third matrix.
In addition, in order to reduce the occupation of VSP memory, after the K vector stream processors sequentially accumulate the multiplication results generated by the elements in the same row of the first matrix and the corresponding elements of the second matrix one by one in parallel, that is, after all the elements in the same row of the third matrix are obtained, the method further includes: the K vector stream processors store the accumulated results in parallel row-wise order to areas of the LDS that do not overlap the first matrix. For example, VSP1 is obtaining C11Then, adding C11Address1, VSP2 stored in LDS gets C12Then, adding C12Address2, … …, VSP64 stored in LDS gets C164Then, adding C164The addresses 64 stored in the LDS are all regions where the addresses 1-64 do not overlap with the regions where the unread elements in the first matrix are located. Of course, the K vector stream processors may store the accumulated result in parallel in the corresponding VGPR, and in the area not overlapping with the second matrix.
The data processing method provided by the embodiment of the present application has the same implementation principle and technical effect as the matrix multiplier, and for the sake of brief description, reference may be made to the corresponding contents in the foregoing device embodiments for the parts that are not mentioned in the method embodiments.
Embodiments of the present application also provide an integrated circuit device that includes a substrate and a matrix multiplier disposed on the substrate. The substrate may be a circuit substrate commonly used at present, such as a PCB board. It should be noted that, since the local share unit LDS can implement data sharing, two or more matrix multipliers can share one local share unit LDS, for example, it is necessary to calculate the matrix a × matrix B and the matrix a × matrix C, at this time, two matrix multipliers share one local share unit LDS, that is, the elements in the matrix a are stored in the LDS according to the row sequence, and when performing matrix calculation, the elements stored in the local share unit LDS are loaded in parallel to K vector stream processors in the first matrix multiplier one by one and K vector stream processors in the second matrix multiplier in parallel. Accordingly, the integrated circuit device may also not comprise the element LDS in the matrix multiplier, i.e. the LDS is not integrated in the integrated circuit device, but exists separately.
The embodiment of the present application further provides a processor at least including the integrated circuit device, where the processor may be a general-purpose processor, such as a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a microprocessor, and the like; it may also be an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable gate array (FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware component.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.