WO2022143799A1 - 用于矩阵乘操作的集成电路装置、计算设备、系统和方法 - Google Patents

用于矩阵乘操作的集成电路装置、计算设备、系统和方法 Download PDF

Info

Publication number
WO2022143799A1
WO2022143799A1 PCT/CN2021/142653 CN2021142653W WO2022143799A1 WO 2022143799 A1 WO2022143799 A1 WO 2022143799A1 CN 2021142653 W CN2021142653 W CN 2021142653W WO 2022143799 A1 WO2022143799 A1 WO 2022143799A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
block
sub
blocks
computing
Prior art date
Application number
PCT/CN2021/142653
Other languages
English (en)
French (fr)
Inventor
孙正
李明
俞烨昊
陈支泽
边毅
Original Assignee
中科寒武纪科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中科寒武纪科技股份有限公司 filed Critical 中科寒武纪科技股份有限公司
Priority to US18/013,635 priority Critical patent/US20230376562A1/en
Publication of WO2022143799A1 publication Critical patent/WO2022143799A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5443Sum of products

Definitions

  • This disclosure relates generally to the field of data processing. More particularly, the present disclosure relates to an integrated circuit device, board, computing device, computing system and method for matrix multiply operations.
  • the field of artificial intelligence usually involves a large amount of data processing and operations, including matrix multiplication operations on various types of data.
  • many computing tasks involve large-scale matrix multiplication operations, especially the multiplication operations of large matrices.
  • matrix multiplication operations including, for example, the matrix multiplication operation of the weight matrix and input vector in the fully connected layer, and the input vector sum in the convolutional layer.
  • the matrix multiplication operation of the convolution kernel It is conceivable that the larger the data volume and data scale of the matrix multiplication operation involved, the higher the storage requirements for the computing platform (especially the system-on-chip).
  • a processor such as a central processing unit (“CPU”) or a graphics processing unit (“GPU”) is usually used to perform the operation.
  • CPU central processing unit
  • GPU graphics processing unit
  • the processor is limited by the capacity of the internal register resources, the huge amount of data operations may result in a large amount of data interaction between the processor and the external storage device.
  • I/O input/output
  • the bandwidth of the input/output (“I/O") bus between the processor and external memory is limited, it is likely that severe I/O bottlenecks will occur, resulting in delayed and extremely delayed data transfer. This greatly reduces the operational efficiency of parallel operations.
  • not only the bandwidth limitation of the I/O bus will become the bottleneck of the system performance, but also the large amount of I/O memory accesses between the processor and the external storage device will also adversely affect the computing and power consumption overhead.
  • the present disclosure provides a hardware architecture and an operation manner capable of efficiently performing matrix multiplication operations, thereby reducing the amount of data transmission with external storage devices and minimizing busses
  • the solution to the I/O bottleneck problem caused by bandwidth limitation improves the operation efficiency of matrix multiplication.
  • the present disclosure provides the aforementioned solutions in the following aspects.
  • an integrated circuit device for a matrix multiply operation comprising: an interface unit configured to obtain matrix data for the matrix multiply operation from an external memory, wherein the matrix data including a first matrix and a second matrix, wherein the first matrix and the second matrix are respectively divided into N 2 first matrix blocks and N 2 second matrix blocks, and the matrix multiplication of the first matrix and the second matrix Operations include N 2 matrix multiplication tasks based on N 2 first matrix blocks and N 2 second matrix blocks, where N is a positive integer greater than or equal to 2; N 2 main computational units, the N 2 main computational units The units are connected in sequence to form a data transfer loop, wherein each main computing unit is configured to perform a respective one of the N2 matrix multiplication tasks, and includes: a plurality of memory areas configured to store matrices for performing the matrix multiplication tasks blocks and intermediate results; and a control unit configured to perform matrix block exchanges with adjacent main computing units.
  • each of the main computing units is configured to: obtain a first matrix block and a second matrix block associated with its matrix multiplication task through the interface unit, and store them in the in the first storage area and the second storage area; perform a matrix multiplication operation on the one first matrix block and the one second matrix block to obtain an intermediate result; through the control unit and using the first storage area and The second storage area is used to perform N-1 matrix block exchanges with the adjacent main computing unit, and perform matrix multiplication operations on the first matrix block and the second matrix block exchanged each time, so as to obtain N-1 intermediate blocks respectively. result; and performing a sum operation on the N intermediate results to complete the matrix multiplication task associated therewith.
  • the present disclosure discloses a board including an integrated circuit device as previously described and described later in various embodiments.
  • the present disclosure discloses a computing device including a board as previously described and described later in various embodiments.
  • the present disclosure discloses a computing system comprising a computing device as previously described and described later in various embodiments.
  • the present disclosure discloses a method of performing a matrix multiply operation using an integrated circuit device as previously described and described later in various embodiments, comprising: retrieving from an external memory using an interface unit of the integrated circuit device matrix data for the matrix multiply operation, wherein the matrix data includes a first matrix and a second matrix, wherein the first matrix and the second matrix are divided into N 2 first matrix blocks and N 2 second matrix blocks, respectively matrix blocks, and the matrix multiply operations of the first and second matrices include N 2 matrix multiply tasks based on N 2 first matrix blocks and N 2 second matrix blocks, where N is greater than or equal to 2 a positive integer; and use each of the main computing units to perform the following operations: obtain, through the interface unit, a first matrix block and a second matrix block associated with its matrix multiplication task, and store in the first memory area and the second matrix block, respectively in the second storage area; perform a matrix multiplication operation on the one first matrix block and the one second matrix block to obtain an intermediate result; through the control unit and using the first storage area and the second storage
  • the present disclosure provides a computer program product comprising program instructions for performing a matrix multiply operation which, when executed by one or more processors, cause the implementation of the above-described and later described The method described in various embodiments.
  • the on-chip resources of the system-on-chip can be fully utilized, and data sharing and transmission can be realized between main computing units, thereby significantly reducing and I/O data interaction between external memories, enabling efficient parallel execution of data transfers and multiply operations.
  • the solution of the present disclosure simplifies the complexity of the matrix multiplication operation and supports the matrix multiplication operation for very large matrices.
  • the solution of the present disclosure also improves the execution efficiency of the matrix multiplication operation, reduces the bottleneck problem of the operation performance caused by the bandwidth limitation of the on-chip and off-chip I/O, thereby improving the integrated circuit device. , the overall performance of a computing device, computing system or board.
  • FIG. 1 is a schematic architectural diagram illustrating an integrated circuit device according to an embodiment of the present disclosure
  • FIG. 2 is a schematic structural diagram illustrating a single main computing unit according to an embodiment of the present disclosure
  • FIG. 3 is an architectural diagram illustrating a “2*2” main computing unit according to an embodiment of the present disclosure
  • Figures 4a and 4b are block diagrams illustrating a "2*2" main computing unit for a convolution matrix multiply operation according to an embodiment of the present disclosure
  • Figure 5a and Figure 5b are block diagrams showing the structure of a "2*2" calculation subunit for a convolution matrix multiplication operation according to an embodiment of the present disclosure
  • FIG. 6 is a schematic diagram illustrating a pipeline operation performed by an integrated circuit device according to an embodiment of the present disclosure
  • FIG. 7 is a structural diagram illustrating a “3*3” main computing unit according to an embodiment of the present disclosure
  • FIG. 8 is a diagram illustrating a board for a matrix multiply operation according to an embodiment of the present disclosure
  • FIG. 9 is a diagram illustrating a computing system for a matrix multiply operation according to an embodiment of the present disclosure.
  • FIG. 10 is a flowchart illustrating a method for performing a matrix multiply operation according to an embodiment of the present disclosure
  • FIG. 11 is a block diagram illustrating a combined processing apparatus according to an embodiment of the present disclosure.
  • FIG. 12 is a schematic structural diagram illustrating a board according to an embodiment of the present disclosure.
  • FIG. 1 is a schematic architectural diagram illustrating an integrated circuit device 102 for a matrix multiply operation according to an embodiment of the present disclosure.
  • the figure also shows the external memory 104 that exchanges information with the integrated circuit device 102 .
  • the external memory may be a dynamic random access memory (Dynamic Random Access Memory, "DRAM"), and matrix data related to the matrix multiply operation of the present disclosure may be stored in the dynamic random access memory.
  • DRAM Dynamic Random Access Memory
  • the matrix multiplication operation may involve a first matrix and a second matrix, and the first matrix and the second matrix may be divided into N 2 first matrix blocks and N 2 second matrix blocks, where N is a positive integer greater than or equal to 2.
  • the first matrix and the second matrix can be divided into 4 matrix blocks, for example, for the first matrix or the second matrix of "4*4", it can be divided into 4 "2*2" The first matrix block and the second matrix block.
  • the first matrix and the second matrix can be divided into 9 matrix blocks, for example, for the first matrix or the second matrix of "6*6", it can be divided into 9 "2*2"" of the first matrix block and the second matrix block.
  • the integrated circuit device 102 provided by the present disclosure may include an interface unit 106 and N 2 main computing units 108 .
  • a Direct Memory Access (“DMA") interface can be used as the aforementioned interface unit, so as to send the matrix data of the external memory to a plurality of main computing units 108, for example, as exemplarily shown in the figure Five main computing units and one or more main computing units shown with black dots omitted in the middle.
  • DMA Direct Memory Access
  • the N 2 main computing units of the present disclosure can form an "N*N" computing array to perform matrix multiplication operations in parallel.
  • the N 2 main computing units of the present disclosure are connected in sequence to form a data transfer loop, so that other main computing units can be transferred on a continuous loop including the above-mentioned first matrix block or second matrix block Part of the data of row blocks and column blocks in , so that the corresponding one of the N 2 matrix multiplication tasks mentioned above can be performed.
  • the main computing unit of the present disclosure will be described in detail below with reference to FIG. 2 .
  • the main computing unit of the present disclosure may include M2 computing subunits, ie, a computing array of "M*M", where M is a positive integer greater than or equal to 2.
  • the main computing unit may include multiple storage areas, such as the shared storage area shown in the figure and the private storage area associated with each computing subunit.
  • the shared memory area may be a different memory area than the private memory area.
  • the private storage area may be a storage space specially allocated for temporary storage of the computing subunit in the shared storage area.
  • multiple memory areas in the main computing unit may be configured to store matrix blocks and intermediate results for performing matrix multiplication tasks.
  • the main computing unit of the present disclosure further includes a control unit configured to exchange matrix blocks with the adjacent main computing units.
  • the solution of the present disclosure enables the plurality of main computing units in the integrated circuit device to obtain the respective matrix multiplication tasks from the external memory. Part of the matrix block data, and obtain another part (or other parts) of the matrix block data from one or more adjacently connected main computing units through data interaction, so that the matrix block data required to complete the corresponding matrix multiplication task can be obtained. And based on this, the corresponding matrix multiplication task is completed.
  • each main computing unit may be configured to obtain, through the interface unit, a first matrix block (which is derived from the first matrix) and a second matrix block related to its matrix multiplication task.
  • matrix blocks (which come from the second matrix), and store them in the first storage area and the second storage area, respectively.
  • the first storage area and the second storage area may be two independent storage spaces allocated from the shared storage area to serve as buffers for storing intermediate data.
  • the main computing unit of the present disclosure may perform a matrix multiplication operation on the first matrix block and the second matrix block, thereby obtaining an intermediate result.
  • the matrix multiplication operation of the first matrix block and the second matrix block here can be performed in parallel by M 2 calculation subunits in the main calculation unit.
  • the main computing unit may perform N-1 times of matrix block exchange with the adjacent main computing unit through the control unit and using the first storage area and the second storage area, and exchange the first matrix block and The second matrix block performs a matrix multiplication operation, resulting in N-1 intermediate results.
  • one main computing unit can obtain another first matrix block and another second matrix block from two adjacently connected main computing units, so that One more intermediate result is obtained.
  • the main computing unit of the present disclosure may sum these intermediate results, thereby completing a matrix multiplication task associated therewith.
  • the main computing unit of the present disclosure utilizes its M 2 computing subunits to perform specific matrix multiplication tasks.
  • the matrix multiplication operation of the present disclosure can relate to the case when the first matrix block and the second matrix block can also be divided again.
  • the first matrix block and the second matrix block may be further divided into M 2 first matrix sub-blocks and M 2 second matrix sub-blocks, respectively.
  • one matrix multiplication task of the aforementioned one main computing unit may include M 2 matrix multiplier tasks based on M 2 first matrix sub-blocks and M 2 second matrix sub-blocks.
  • each of the M 2 computational subunits may be configured to perform a corresponding one of the M 2 matrix multiplier tasks.
  • each calculation subunit may be configured to perform M times of matrix multiplication operations, thereby obtaining M intermediate sub-results.
  • the computing subunit may obtain a first matrix subblock and a second matrix subblock associated with its matrix multiplier task from the shared storage area (eg, the first storage area and the second storage area), respectively.
  • the computing subunit may perform a matrix multiplication operation on a first matrix subblock and a corresponding second matrix subblock to obtain an intermediate subresult.
  • the matrix multiplier task associated with it is completed by performing a sum operation on the M intermediate sub-results.
  • the disclosed solution also achieves a high degree of parallel operation.
  • the N 2 main computing units may be configured to perform the matrix multiplication tasks associated with each in parallel
  • the M 2 computing sub-units may be configured to perform the matrix multiplier tasks associated with each in parallel.
  • the division manner may be divided according to the rules of the Cannon ("cannon") algorithm.
  • the first matrix and the second matrix participating in the matrix multiplication operation of the present disclosure may be divided into N 2 first matrix blocks and N 2 second matrix blocks at the level of the main computing unit according to the Cannon algorithm rule.
  • a first matrix block and a second matrix block can be further divided according to the rules of the Cannon algorithm, so as to obtain M 2 first matrix sub-blocks and M 2 first matrix sub-blocks piece.
  • the present disclosure performs multiple (or multiple rounds) of block processing on the matrix multiplication operation between large (or very large) matrices, and through the corresponding The main computing unit and the computing sub-units are used for execution, thereby realizing the parallel pipeline operation of the matrix multiplication operation. Therefore, the solution of the present disclosure achieves significant advantages of simplifying the complexity of the matrix multiplication operation and accelerating the matrix multiplication operation in terms of the matrix multiplication operation. Further, by acquiring all the matrix data from the external memory and exchanging with each other through the control unit in the main computing unit, frequent data interaction with the external memory is avoided and the existing bottleneck of I/O interaction is broken. .
  • the number of main computing units and computing sub-units of the present disclosure can be flexibly set according to the computing scene, and can realize the matrix multiplication operation of any scale in a cascaded manner, so that the architecture is arranged flexibly and supports various types of matrix multiplication. Operational scene.
  • FIG. 3 is an architectural diagram illustrating 2 2 (ie, 4) main computing units according to an embodiment of the present disclosure.
  • the four main computing units (including main computing unit 0 to main computing unit 3 ) are interconnected through the control unit to form a “2*2” computing array.
  • the 4 main computing units may be configured to perform matrix multiplication operations between the 4 first matrix blocks and the 4 second matrix blocks, and each main computing unit may Perform one of the 4 matrix multiply tasks.
  • FIG. 3 also shows M 2 computing subunits included in each main computing unit. By allocating one matrix multiplication task to M 2 computing subunits for execution, parallel pipeline operations can be realized, thereby accelerating the matrix multiplication operation and meeting the needs of various application scenarios.
  • the integrated circuit device of the present disclosure can be applied in the field of artificial intelligence, especially in machine learning including deep neural networks.
  • the integrated circuit device of the present disclosure may perform the convolution operations involved in neural networks, which involve a large number of matrix multiplication operations, on the received first and second matrices.
  • the following will exemplarily describe the matrices involved in the integrated circuit device of the present disclosure performing the convolution operation according to the Cannon algorithm in conjunction with FIG. 4a and FIG. 4b . multiply operation.
  • FIG. 4a shows a schematic diagram of the structure of an integrated circuit device according to an embodiment of the present disclosure, which includes 4 (ie, “2*2”) interconnected main computing units, ie, main computing unit 0 to main computing unit 3 .
  • the figure does not show a plurality of calculation sub-units included in the main calculation unit.
  • Fig. 4b schematically shows two input matrices to be operated and matrix blocks of their computation results.
  • the two matrices on which the matrix multiplication operation is to be performed are a first matrix including the gradient of the convolution result and a second matrix including the convolution input, respectively.
  • the result matrix obtained after the two perform the matrix multiplication operation is the convolution weight gradient.
  • the four main computing units (each being the main computing unit 102 in FIG. 1) are numbered in a clockwise order as main computing unit 0, main computing unit 1, main computing unit 2 and main computing unit 3, and have been connected in sequence to form a closed loop (or loop).
  • main computing unit 0 and 1 there is a two-way communication connection between adjacent main computing units 0 and 1, for example, two-way communication can be performed between the main computing units through DMA.
  • main computing units 1 and 2 and 3 and 3 and 0, respectively so that under the control of the control unit Carry out mutual transfer of matrix blocks.
  • each main computing unit can also communicate with an external memory (shown by a dotted box in the figure) via the interface unit, so as to obtain the matrix block data (in this example, the convolution result gradient and convolution input).
  • the gradient of the convolution weight which is the result of the matrix multiplication in this example, can be used to update the gradient of the convolution result in the forward propagation during the back propagation of the neural network.
  • the convolution weight gradient calculation is equivalent to the convolution result gradient of the first matrix in this example (when it is a four-dimensional matrix, its dimension can be expressed as NiHiWiCi as shown in the figure) and as this example Multiply-accumulate computation of the convolution input of the second matrix in (when it is a four-dimensional matrix, its dimensions are expressed as NoHoWoCo as shown in the figure).
  • N is the number of samples
  • H is the height of the matrix
  • W is the width of the matrix
  • C is the number of channels.
  • the input matrix "convolution result gradient” can be expressed as Ci*NiHiWi
  • the input matrix "convolution input” can be expressed as NoHoWo*Co
  • the two are convolved in the direction of NiHiWi and NoHoWo Weight gradient calculation (such as multiply-add operation)
  • the final output matrix "convolution weight gradient” can be expressed as Kh*Kw*Ci*Co (where Kh represents the height of the output matrix, Kw represents the width of the output matrix, and Ci represents the The number of channels of the input matrix "convolution result gradient", Co represents the number of channels of the input matrix "convolution input”).
  • Kh represents the height of the output matrix
  • Kw represents the width of the output matrix
  • Ci represents the The number of channels of the input matrix "convolution result gradient
  • Co represents the number of channels of the input matrix "convolution input”).
  • the first matrix "convolution result gradient" stored in the external memory can be and the second matrix "convolution input” are divided into four matrix blocks respectively.
  • the four matrix blocks divided by the first matrix "Convolution Result Gradient” are denoted as A00, A01, A10 and A11 as shown in Fig. 4b.
  • the four matrix blocks dividing the second matrix "Convolution Input” are denoted as B00, B01, B10 and B11.
  • the output matrix "convolution weight gradient” which is the result matrix can also be divided into four matrix blocks C00, C01, C10 and C11.
  • each main computing unit can execute the following formulae (1) to (4) respectively, so as to obtain the corresponding convolution weight gradients C00, C01, C11 and C10 respectively:
  • the solution of the present disclosure can use the four main computing units 0, 1, 2 and 3 shown in FIG. 4a to perform the computing tasks corresponding to the above equations (1) to (4), respectively, to obtain C00 , C01, C11 and C10.
  • the positions of A10 and A11 of the input matrix "convolution result gradient" shown in Figure 4b can be exchanged according to the rules of the Cannon algorithm, and the input matrix "roll”
  • the positions of B01 and B11 of "product input” are exchanged, as indicated by the arrows in Figure 4b.
  • each main computing unit may receive its corresponding one first matrix block and one second matrix block from the external memory, and perform corresponding matrix multiplication calculations.
  • the main computing unit 0 may receive a first matrix block "A00” and a second matrix block “A00” of the first matrix “Convolution Result Gradient” and the second matrix “Convolution Input” from the external memory via the interface unit B00” and executes the first matrix multiplier task (A00*B00) as part of its matrix multiplication task according to equation (1), where "*" denotes a matrix multiplication operation.
  • the main computing unit 1 receives its corresponding one first matrix block and one second matrix block, namely (A01 and B11) via the interface unit, and performs its first matrix multiplication task (A01*B11) according to equation (2) ).
  • the main computing units 2 and 3 respectively receive a first matrix block and a second matrix block (A10 and B01) and (A11 and B10) respectively via the data interface, and according to equations (3) and (4), respectively The respective first matrix multiply tasks (A10*B01) and (A11*B10) are performed.
  • Each main computing unit may also receive another first matrix block and another second matrix block from the interconnected main computing units during the time when each main computing unit completes receiving the data of the matrix block from the external memory and performing the matrix multiplication task.
  • each main computing unit of the present disclosure can use a two-way communication connection to send the partial matrix block data received from the external memory to the adjacent main computing unit, respectively, to serve as the adjacent main computing unit The corresponding matrix block data for another (or second) matrix multiply task of .
  • obtaining "C00” can be regarded as the matrix multiplication task of the main computing unit 0, and according to formula (1), it can be known that another first matrix multiplication task required to complete the second matrix multiplication task in the "C00" matrix multiplication task is completed.
  • the matrix block and the second matrix block are "A01" and "B10", respectively.
  • the main computing unit 1 adjacent to the main computing unit 0 can send the first matrix block "A01” previously received from the external memory to the main computing unit 0.
  • the main computing unit 3 adjacent to the main computing unit 0 may send the first matrix block B10 previously received from the external memory to the main computing unit 0 .
  • the main computing unit 0 can complete its second matrix multiplication task by performing a matrix multiplication operation on the received matrix block data "A01" and "B10".
  • the main computing units 1, 2 and 3 can also use the two-way communication connection to receive the matrix block data sent by the adjacent main computing units, that is, a corresponding first matrix block and a second matrix block, as shown in the figure. (“A00" and “B01”), ("A11” and “B11”), and ("A10" and "B00").
  • each main computing unit may perform a respective second matrix multiplication task according to equations (1) to (4), and obtain each main computing unit by summing the intermediate results of the first and second matrix multiplication tasks
  • the respective associated matrix multiplication results namely the convolution weight gradients C00, C01, C11, and C10 in this example, complete their respective matrix multiplication tasks.
  • each main computing unit of the present disclosure only needs to receive part of the matrix block data from the external memory, while the reception of another part of the matrix block data makes better use of the main computing unit High-speed communication bus between units.
  • the solution of the present disclosure significantly reduces the data interaction between the main computing unit and the external memory, thereby significantly reducing the amount of data transmission between on-chip and off-chip I/O and overcoming the I/O bottleneck caused by bandwidth limitation.
  • the four main computing units shown in Figure 4a forming a closed loop are merely exemplary and not limiting. According to specific application scenarios, those skilled in the art can also pre-arrange other suitable numbers of main computing units to form processing arrays and data transfer loops, such as those shown in FIG. 7 (described in detail later).
  • the matrix multiplication operation of the present disclosure may be performed by a plurality of computing subunits within each main computing unit.
  • the first matrix block and the second matrix block of the present disclosure can be further divided into a plurality of first matrix sub-blocks and second matrix sub-blocks, and thus each matrix multiplication task (for example, the above equations (1), (2), (3) or (4)) may be divided into a plurality of matrix multiplier tasks corresponding to each of the plurality of calculation subunits.
  • each computing subunit may read a corresponding first matrix subblock and a second matrix subblock from the shared storage area to perform matrix operations.
  • FIG. 5a and FIG. 5b For better understanding, how each calculation subunit completes the corresponding matrix multiplier task according to the rules of the Cannon algorithm will be discussed below with reference to FIG. 5a and FIG. 5b.
  • FIG. 5a and FIG. 5b are block diagrams illustrating the structure of a “2*2” calculation subunit for a convolution matrix multiplication operation according to an embodiment of the present disclosure.
  • the main computing unit 0 performs its first matrix multiplication task “A00*B00” according to the Cannon algorithm in the preceding convolution weight gradient calculation.
  • the main computing unit 0 includes a shared memory area and four computing subunits sequentially numbered 0, 1, 2 and 3 (each a computing subunit in FIG. 2 ).
  • each computing subunit may receive (or load) the respective matrix data of the first matrix subblock and the second matrix subblock from the shared memory area.
  • each calculation subunit in FIG. 5a receives respective one first matrix subblock and one second matrix subblock from the shared memory area, and performs corresponding operations to obtain a Intermediate results. Repeating the preceding steps, each computational subunit can obtain another intermediate subresult. Finally, by summing the aforementioned two intermediate sub-results, an intermediate result for its matrix multiplier task is obtained.
  • first matrix block “convolution result gradient” A00 eg a four-dimensional matrix, denoted as Ci*NiHiWi
  • second matrix block “convolution input” B00 stored in the shared memory area (for example, a four-dimensional matrix, denoted as NoHoWo*Co) as two input data to perform the first matrix multiplication task "convolution weight gradient" (A00*B00) of the main computing unit 0 (for the purpose of simplicity in the figure, only Ci*Co orientation is shown).
  • A00 can be divided into four first matrix sub-blocks a00, a01, a10 and a11 according to the Cannon algorithm
  • B00 can be divided into four second matrix sub-blocks b00, b01, b10 and b11, and the eight Matrix sub-blocks are stored in shared memory.
  • the result C00 of the output matrix (A00*B00) can also be divided into four sub-blocks c00, c01, c10 and c11. Based on this, according to the operation rules of matrix multiplication in the Cannon algorithm, c00, c01, c11 and c10 can be obtained by the following equations (5) to (8):
  • the four calculation subunits 0, 1, 2 and 3 shown in FIG. 5a can be made to perform the calculations in the above equations (5) to (8) respectively, that is, to respectively perform their respective matrix multiplier tasks to get the corresponding c00, c01, c11 and c10.
  • the matrix sub-blocks of the calculation subunit 0 that perform the matrix multiplier task include a00, b00, a01 and b10.
  • the matrix sub-blocks of the calculation sub-unit 2 that executes this subtask are a10, b01, a11 and b11.
  • the positions of a10 and a11 of the "convolution result gradient" A00 shown on the left side of Figure 5b can be exchanged, and the "convolution input" B00
  • the positions of b01 and b11 are swapped.
  • the first and second matrix sub-blocks of the calculation subunit 1 that perform the task of obtaining the matrix multiplier of c01 are a00, b01, a01 and b11
  • the first and second matrix sub-blocks of the calculation subunit 3 that executes the task of obtaining the matrix multiplier of c10 are The first and second matrix sub-blocks are a10, b00, a11 and b10.
  • each of the four computing subunits can receive respective first and second matrix subblocks from the shared memory area.
  • compute subunit 0 can load (a00 and b00) from shared memory to perform a matrix multiply computation of (a00*b00).
  • computation subunit 0 can then load the (a01 and b10) portion from the shared memory area to perform a matrix multiplication computation of (a01*b10).
  • the calculation subunit 0 completes its associated matrix multiplier task. For computation subunits 1, 2, and 3, it also performs operations similar to computation subunit 0, thereby completing the respective matrix multiplier tasks.
  • each matrix multiplier subtask in the first matrix multiplication task (eg A00*B00) of the main computing unit 0 is only an intermediate sub-result. Therefore, it is necessary to further complete multiple matrix multiplier tasks corresponding to the second matrix multiplication task (such as A01*B10) to obtain another intermediate result, which can be obtained by summing the two intermediate results as shown in Figure 5b
  • the calculation subunit 0 may perform, for example, a matrix multiplication subtask corresponding to the first matrix multiplication task (A00*B00) according to equation (5), and use the obtained c00 as the first sub c00 1 .
  • the calculation subunit 0 is used to execute the corresponding matrix multiplication subtask in the second matrix multiplication task (A01*B10) of C00 to obtain the second sub c00 2 . at last.
  • the two sub c00 1 and c00 2 can be summed to obtain the matrix block c00 in the output matrix block C00.
  • Equation (5) includes two-part addition operations and thus c00 2 is obtained by adding two intermediate results, c00 1 can also be sequentially added with the first intermediate result and the second intermediate result of c00 2 , Thereby, the matrix sub-block c00 is obtained, and the specific operation will be described later with reference to the calculation operation column of the sixth and seventh time slices in FIG. 6 .
  • the calculation subunits 1, 2 and 3 can also obtain the matrix subblocks c01, c11 and c10 in C00, respectively, resulting in the four matrix subblocks shown on the right side of Figure 5b c00, c01, c11 and c10 constitute the output matrix block C00 obtained by the main computing unit 0 performing the matrix multiplication task. Since the intermediate calculation results (eg, c00, c01, c11, and c10) of each calculation subunit can also be stored in the shared storage area of the corresponding main calculation unit without being stored in an external memory. Thus, the solution of the present disclosure can reduce the data exchange with the external memory, thereby reducing the I/O bottleneck caused by the external bandwidth limitation.
  • main computing unit shown in FIG. 5a includes four computing sub-units is only exemplary and not limiting. According to different application scenarios, those skilled in the art can preset different numbers of calculation subunits based on the teachings of the present disclosure, or enable or disable different numbers of calculation subunits, so as to perform matrix multiplication calculations such as the Cannon algorithm.
  • FIG. 6 is a schematic diagram illustrating pipeline operations performed by an integrated circuit device (including a main computing unit and its computing sub-units) according to an embodiment of the present disclosure.
  • Fig. 6 takes the main computing unit 0 and its computing sub-unit 0 shown in Fig. 5a and Fig. 5b as an example to perform convolution operation, and shows the main computing unit 0, the computing sub-unit 0, the external memory and the shared memory in chronological order.
  • Data transfers and specific operations between memory areas including, for example, data loads and matrix multiply operations).
  • Fig. 6 shows in the form of rows from the first time slice to the end of the eighth time slice, the main computing unit 0 and its computing subunit 0 perform corresponding data reception, transmission, A load or matrix multiplication operation to finally obtain a pipeline of matrix sub-blocks c00 in the output matrix block C00 with respect to the gradient of the convolution weights.
  • column 1 represents the operation of loading data from external memory (eg, via DDR), from which the first and second matrix blocks discussed in this disclosure are received, for example;
  • column 2 represents the inter-main computing unit data transfer, e.g.
  • the shared memory area of master computing unit 0 sends its first and second matrix blocks to adjacent master computing units 1 and 3, and receives their first and second matrices from master computing units 1 and 3
  • the block is used as the operation data for the second matrix multiplication task performed by the main computing unit 0;
  • the third column represents the data loading of the computing subunit 0;
  • the fourth column represents the matrix multiplying operation performed in the computing subunit 0.
  • the main computing unit 0 performs the corresponding operation in the corresponding time slice.
  • the shared memory area of the main computing unit 0 only performs the operation of storing the B00 received from the external memory (ie, "off-chip").
  • the shared storage area of the main computing unit executes the operation of receiving A00 from the external memory and the computing subunit 0 executes the operation of loading b00 in B00 from the shared storage area.
  • the on-chip operations of the present disclosure may be ping-pong pipelining operations.
  • the on-chip storage resources can be divided into two parts: ping ("ping") and pong ("pong").
  • ping ping
  • pong pong
  • the main computing unit of the present disclosure can perform parallel ping-pong pipelining operations.
  • the main computing unit 0 loads B00 from the external memory and stores it in the ping part of the shared memory area.
  • the main computing unit 0 loads A00 from the external memory and stores it in the ping part of the shared memory area.
  • the b00 of B00 can be loaded to the calculation subunit 0 in parallel.
  • a00 of A00 can be loaded into calculation subunit 0.
  • the main computing unit 0 sends A00 to the interconnected main computing unit 1 and B00 to the interconnected main computing unit 3 through the control unit.
  • A01 from the main computing unit 1 and B10 from the main computing unit 3 are received via the control unit.
  • b10 of B00 and a01 of A00 can be loaded into the calculation subunit 0; at the same time, in the matrix multiplication operation column of the fourth time slice, a00*b00 of A00 and B00 can be calculated as Obtain intermediate results.
  • the intermediate sub-result is accumulated, and the intermediate sub-result and the intermediate sub-result of the previous time slice are accumulated to obtain the intermediate result at the fifth time slice.
  • b10 of B10 and a01 of A01 can be loaded into the calculation subunit 0; at the same time, in the matrix multiplication operation column of the 6th time slice, calculate a00*b00 of A01 and B10 to obtain The intermediate sub-result is accumulated, and the intermediate sub-result and the intermediate sub-result of the previous time slice are accumulated to obtain the intermediate result of the sixth time slice.
  • the pong part of the aforementioned on-chip storage resource is used to receive the next set of B00 (B00') and A00 (A00') from the external memory for main computation Cell 0 performs its first matrix multiplication task.
  • the calculation subunit 0 stores the c00 of the C00 output by the calculation of the previous slice into the shared storage area. Simultaneously load b00 of the next group of B00' and a00 of A00' to the calculation subunit 0 for calculation in the next time slice (not shown).
  • the calculation subunits 1, 2 and 3 of the main calculation unit 0, and the different main calculation units and their calculation subunits also perform similar operations for the above 8 time slices to obtain the corresponding matrix blocks of the respective output matrices. Since the input matrix "convolution result gradient" and "convolution input” can be multi-dimensional structures, the calculation results in the three directions of the NHW can be calculated first and accumulated. Then, the above calculation is performed in a loop on the Ci and Co dimensions of the two input matrices to obtain the calculation result of the output matrix "convolution weight gradient".
  • FIG. 7 is a structural diagram illustrating a “3*3” main computing unit according to an embodiment of the present disclosure.
  • the “3*3” main computing unit can perform the matrix multiplication operation shown in the upper part of FIG. 7 by forming a loop of calculation array and data transmission.
  • the "3*3” main computing unit needs to perform 2 data transfers between adjacent main computing units, instead of the "2*2" main computing unit.
  • One data transfer In other words, for the scheme of the present disclosure, "N*N" main computing units require (N-1) data transfers or exchanges between adjacent main computing units.
  • N-1) data transfers or exchanges between adjacent main computing units For ease of understanding, the lower part of FIG.
  • FIG. 7 shows the data of the first matrix block and the second matrix block obtained by each main computing unit after the first and second rounds of data transfer.
  • the main computing unit 5 After obtaining its first matrix block "A23" and the second matrix block "B32" from the external memory, in the first round of data transfer, it receives another first matrix block from the main computing unit 6.
  • a matrix block "A21” and a second matrix block “B12” are received from the main computing unit 8 to perform its corresponding matrix multiplication task "A21*B12".
  • the second round of data transfer it receives another first matrix block "A22” from the main computing unit 6 and a second matrix block "B22" from the main computing unit 8 to perform its corresponding matrix multiplication task "A22*B22".
  • the "3*3" main computing unit can support dividing two large matrices into two "3*3" matrix blocks respectively to execute the matrix multiply operation.
  • FIG. 8 is a diagram illustrating a board 800 for matrix multiply operations according to an embodiment of the present disclosure.
  • the board includes four integrated circuit devices as previously described in connection with FIGS. 1-7 .
  • P is a positive integer greater than or equal to two.
  • the disclosed scheme can perform matrix multiplication operations on the first matrix and the second matrix divided into "P 2 *N 2 *M 2 " matrix blocks, respectively.
  • FIG. 9 is a diagram illustrating a computing system 900 for matrix multiply operations according to an embodiment of the present disclosure.
  • the computing system 900 includes four servers or mainframes, wherein one or more of the boards shown in FIG. 8 are arranged in each mainframe to support matrix multiply operations of very large-scale matrices.
  • two super-sized matrices are multiplied, they can be divided into four matrix blocks respectively according to the computing system of FIG. 9 .
  • each matrix block is further divided according to the number of boards on each host. And so on, until the very large matrix participating in the matrix multiplication calculation is divided into the matrix multiplication operation granularity supported by the calculation subunit of the present disclosure.
  • FIG. 10 is a flowchart illustrating a method 1000 for performing a matrix multiply operation according to an embodiment of the present disclosure.
  • the method 1000 can be performed by the integrated circuit device of the present disclosure, so the description about the integrated circuit device is also applicable to the description of the method 1000 below.
  • method 1000 retrieves matrix data for the matrix multiply operation from external memory using an interface unit of an integrated circuit device.
  • the matrix data here includes a first matrix and a second matrix, wherein the first matrix and the second matrix are divided into N 2 first matrix blocks and N 2 second matrix blocks, respectively, and the The matrix multiplication operation of the first matrix and the second matrix includes N2 matrix multiplication tasks based on N2 first matrix blocks and N2 second matrix blocks, where N is a positive integer greater than or equal to 2.
  • the method 1000 executes steps 1004-1010 to complete the matrix multiplication task of the main computing unit.
  • the method 1000 obtains a first matrix block and a second matrix block associated with its matrix multiplication task through the interface unit, and stores them in the first storage area and the second storage area, respectively.
  • the method 1000 performs a matrix multiplication operation on the one first matrix block and the one second matrix block to obtain an intermediate result.
  • the method 1000 performs N-1 matrix block swaps with the adjacent main computing unit through the control unit and using the first storage area and the second storage area, and performs N-1 matrix block swaps with each swap to A matrix block and a second matrix block perform matrix multiplication operations to obtain N-1 intermediate results, respectively.
  • method 1000 performs a sum operation on the N intermediate results to complete the matrix multiplication task associated therewith.
  • the method of the present disclosure has been described above only in connection with FIG. 10 .
  • the method 1000 of the present disclosure may include more steps according to the disclosure of the present disclosure, and the execution of these steps may realize various operations of the present disclosure described above in conjunction with FIG. 1 to FIG. 9 . It is not repeated here.
  • FIG. 11 is a structural diagram illustrating a combined processing apparatus 1100 according to an embodiment of the present disclosure.
  • the combined processing device 1100 includes a computing processing device 1102 , an interface device 1104 , other processing devices 1106 and a storage device 1108 .
  • one or more integrated circuit devices 1110 may be included in the computing processing device, and the integrated circuit devices may be configured to perform the matrix multiplication operations described herein in conjunction with FIGS. 1-10 .
  • the computing processing devices of the present disclosure may be configured to perform user-specified operations.
  • the computing processing device may be implemented as a multi-core artificial intelligence processor.
  • one or more computing devices included within a computing processing device may be implemented as an artificial intelligence processor core or as part of a hardware structure of an artificial intelligence processor core.
  • multiple computing devices are implemented as an artificial intelligence processor core or a part of the hardware structure of an artificial intelligence processor core, for the computing processing device of the present disclosure, it can be regarded as having a single-core structure or a homogeneous multi-core structure.
  • the computing processing apparatus of the present disclosure may interact with other processing apparatuses through an interface apparatus to jointly complete an operation specified by a user.
  • other processing devices of the present disclosure may include central processing units (Central Processing Unit, CPU), graphics processing units (Graphics Processing Unit, GPU), artificial intelligence processors and other general-purpose and/or special-purpose processors.
  • processors may include, but are not limited to, Digital Signal Processor (DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable Logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • the computing processing device of the present disclosure can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when computing processing devices and other processing devices are considered together, the two can be viewed as forming a heterogeneous multi-core structure.
  • the other processing device may serve as an interface for the computing processing device of the present disclosure (which may be embodied as a related computing device for artificial intelligence such as neural network operations) with external data and control, performing operations including but not limited to Limited to basic controls such as data movement, starting and/or stopping computing devices.
  • other processing apparatuses may also cooperate with the computing processing apparatus to jointly complete computing tasks.
  • the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices.
  • the computing and processing device may obtain input data from other processing devices via the interface device, and write the input data into the on-chip storage device (or memory) of the computing and processing device.
  • the computing and processing device may obtain control instructions from other processing devices via the interface device, and write them into a control cache on the computing and processing device chip.
  • the interface device can also read the data in the storage device of the computing processing device and transmit it to other processing devices.
  • the combined processing device of the present disclosure may also include a storage device.
  • the storage device is connected to the computing processing device and the other processing device, respectively.
  • a storage device may be used to store data of the computing processing device and/or the other processing device.
  • the data may be data that cannot be fully stored in the internal or on-chip storage of a computing processing device or other processing device.
  • the present disclosure also discloses a chip (eg, chip 1202 shown in FIG. 12).
  • the chip is a System on Chip (SoC) and integrates one or more combined processing devices as shown in FIG. 11 .
  • the chip can be connected with other related components through an external interface device (such as the external interface device 1206 shown in FIG. 12 ).
  • the relevant component may be, for example, a camera, a display, a mouse, a keyboard, a network card or a wifi interface.
  • other processing units such as video codecs
  • interface modules such as DRAM interfaces
  • the present disclosure also discloses a chip package structure including the above-mentioned chip.
  • the present disclosure also discloses a board including the above-mentioned chip package structure. The board will be described in detail below with reference to FIG. 12 .
  • FIG. 12 is a schematic structural diagram illustrating a board 1200 according to an embodiment of the present disclosure, wherein the board shown in FIG. 8 can be regarded as a specific form of the board 1200 .
  • the board includes a storage device 1204 for storing data, which includes one or more storage units 1210 .
  • the storage device can be connected to the control device 1208 and the chip 1202 described above for connection and data transmission through, for example, a bus.
  • the board also includes an external interface device 1206, which is configured for data relay or transfer function between the chip (or the chip in the chip package structure) and the external device 1212 (such as a server or a computer, etc.).
  • the data to be processed can be transmitted to the chip by an external device through an external interface device.
  • the calculation result of the chip may be transmitted back to the external device via the external interface device.
  • the external interface device may have different interface forms, for example, it may adopt a standard PCIE interface and the like.
  • control device in the board of the present disclosure may be configured to regulate the state of the chip.
  • control device may include a single-chip microcomputer (Micro Controller Unit, MCU) for regulating the working state of the chip.
  • MCU Micro Controller Unit
  • an electronic device or device which may include one or more of the above-mentioned boards, one or more of the above-mentioned chips and/or one or a plurality of the above-mentioned combined processing devices.
  • the electronic devices or devices of the present disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, home appliances, and/or medical equipment.
  • the vehicles include airplanes, ships and/or vehicles;
  • the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods;
  • the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph.
  • the electronic equipment or device of the present disclosure can also be applied to the Internet, Internet of Things, data center, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical care and other fields. Further, the electronic device or device of the present disclosure can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge terminal, and terminal.
  • the electronic device or device with high computing power according to the solution of the present disclosure can be applied to a cloud device (eg, a cloud server), while the electronic device or device with low power consumption can be applied to a terminal device and/or Edge devices (such as smartphones or cameras).
  • the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be retrieved from the hardware information of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device. Match the appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-device integration.
  • the present disclosure expresses some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solutions of the present disclosure are not limited by the order of the described actions . Accordingly, those of ordinary skill in the art, based on the disclosure or teachings of this disclosure, will appreciate that some of the steps may be performed in other orders or concurrently. Further, those skilled in the art can understand that the embodiments described in the present disclosure may be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily necessary for the realization of one or some solutions of the present disclosure. In addition, according to different solutions, the present disclosure also has different emphases in the description of some embodiments. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present disclosure, and can also refer to the related descriptions of other embodiments.
  • units illustrated as separate components may or may not be physically separate, and components shown as units may or may not be physical units.
  • the aforementioned components or elements may be co-located or distributed over multiple network elements.
  • some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure.
  • multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit physically exists independently.
  • the above integrated units may be implemented in the form of software program modules. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated unit may be stored in a computer-readable memory. Based on this, when the aspects of the present disclosure are embodied in the form of a software product (eg, a computer-readable storage medium), the software product may be stored in a memory, which may include several instructions to cause a computer device (eg, a personal computer, a server or network equipment, etc.) to execute some or all of the steps of the methods described in the embodiments of the present disclosure.
  • a computer device eg, a personal computer, a server or network equipment, etc.
  • the aforementioned memory may include, but is not limited to, a U disk, a flash disk, a read-only memory (Read Only Memory, ROM), a random access memory (Random Access Memory, RAM), a mobile hard disk, a magnetic disk, or a CD, etc. that can store programs. medium of code.
  • the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits, and the like.
  • the physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but are not limited to, devices such as transistors or memristors.
  • the various types of devices described herein may be implemented by suitable hardware processors, such as CPUs, GPUs, FPGAs, DSPs, ASICs, and the like.
  • the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High Bandwidth Memory (High Bandwidth Memory) , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.
  • a variable resistance memory Resistive Random Access Memory, RRAM
  • Dynamic Random Access Memory Dynamic Random Access Memory
  • SRAM Static Random Access Memory
  • EDRAM Enhanced Dynamic Random Access Memory
  • HBM High Bandwidth Memory
  • HBM Hybrid Memory Cube
  • ROM and RAM etc.
  • An integrated circuit device for a matrix multiply operation comprising:
  • an interface unit configured to obtain matrix data for the matrix multiply operation from an external memory, wherein the matrix data includes a first matrix and a second matrix, wherein the first matrix and the second matrix are respectively divided into N 2 A first matrix block and N 2 second matrix blocks, and the matrix multiplication operation of the first matrix and the second matrix includes N 2 matrix multiplications based on the N 2 first matrix blocks and the N 2 second matrix blocks tasks, where N is a positive integer greater than or equal to 2;
  • each main computing unit is configured to perform a corresponding one of the N 2 matrix multiplication tasks, and includes:
  • a plurality of memory areas configured to store matrix blocks and intermediate results for performing matrix multiply tasks
  • control unit configured to perform matrix block exchange with an adjacent main computing unit
  • each of the main computing units is configured to:
  • N-1 times of matrix block exchange is performed with the adjacent main computing unit, and the first matrix block and the second matrix exchanged each time are exchanged.
  • block performs matrix multiplication operations to obtain N-1 intermediate results, respectively;
  • each of the main computing units includes M 2 computing subunits, and the first matrix block and the second matrix block are respectively divided into M 2 a first matrix sub-block and M 2 second matrix sub-blocks, and one of said matrix multiplication tasks includes M 2 matrix multiplier tasks based on M 2 first matrix sub-blocks and M 2 second matrix sub-blocks, wherein each of the M 2 computing subunits is configured to perform a corresponding one of the M 2 matrix multiplier tasks, and in performing the corresponding one matrix multiplier task, the computing subunits are configured to:
  • a sum operation is performed on the M intermediate sub-results to complete the matrix multiplier task associated therewith.
  • Clause A3 The integrated circuit device of Clause A2, wherein the first memory area and the second memory area are shared memory areas shared by the N2 computing subunits.
  • Clause A4 The integrated circuit device of Clause A2, wherein the plurality of storage areas of each of the main computing units further includes M2 private sub-storage areas, and each private sub-storage area is associated with a corresponding one of the computing sub-units associated, and configured to store intermediate sub-results.
  • Clause A5 The integrated circuit device of Clause A2, wherein the N 2 main computational units are configured to perform in parallel the matrix multiply tasks associated with each, and the M 2 computational subunits are configured to perform in parallel with The respective associated matrix multiplier tasks.
  • Clause A6 The integrated circuit device of any one of clauses A1-A5, wherein the first and second matrices are divided according to Cannon algorithm rules to obtain N 2 first matrix blocks and N 2 th Two matrix blocks.
  • Clause A7 The integrated circuit device of any one of clauses A2-A5, wherein the first matrix block and the second matrix block are divided according to Cannon algorithm rules to obtain M 2 first matrix sub-blocks and M 2 second matrix sub-blocks.
  • Item A9 The board according to Item A8, wherein when the board includes P 2 of the integrated circuit devices, the integrated circuit devices are connected in sequence to form a data transfer loop so that the pairs are divided into P 2 respectively *N 2 *M
  • the first matrix and the second matrix of the 2 matrix blocks perform a matrix multiplication operation, and P is a positive integer greater than or equal to 2.
  • a computing system comprising a plurality of computing devices according to Clause A10, wherein the plurality of computing devices are interconnected and operate cooperatively to implement a distributed matrix multiply operation.
  • Clause A12 A method of performing a matrix multiply operation using the integrated circuit device of any one of clauses A1-A7, comprising:
  • the matrix data includes a first matrix and a second matrix, wherein the first matrix and the second matrix are respectively divided into N 2 first matrix blocks and N 2 second matrix blocks, and the matrix multiplication operation of the first matrix and the second matrix includes N 2 matrices based on the N 2 first matrix blocks and the N 2 second matrix blocks multiply tasks, where N is a positive integer greater than or equal to 2;
  • control unit and the first storage area and the second storage area are used to perform N-1 times of matrix block exchange with the adjacent main computing unit, and the first matrix block and the second matrix exchanged each time are exchanged.
  • block performs matrix multiplication operations to obtain N-1 intermediate results, respectively;
  • a sum operation is performed on the M intermediate sub-results to complete the matrix multiplier task associated therewith.
  • Clause A14 The method of Clause A13, wherein the first storage area and the second storage area are shared storage areas shared by the N2 computing subunits.
  • Clause A15 The method according to Clause A13, wherein the plurality of storage areas of each of the main computing units further includes M2 private sub-storage areas, and each private sub-storage area is associated with a corresponding one of the computing sub-units, and is configured to store intermediate sub-results.
  • Clause A16 The method of Clause A13, wherein the N 2 main computing units are used to perform matrix multiply tasks associated with each in parallel, and the M 2 computing subunits are used to perform parallel execution associated with each Matrix multiplier task.
  • Clause A17 The method of any of clauses A12-A16, comprising dividing the first and second matrices using Cannon algorithm rules to obtain N 2 first matrix blocks and N 2 second matrices matrix block.
  • Clause A18 The method of any one of clauses A13-A16, wherein the first matrix block and the second matrix block are divided according to Cannon algorithm rules to obtain M 2 first matrix sub-blocks and M 2 The second matrix sub-block.
  • Clause A19 A computer program product comprising program instructions for performing a matrix multiplication operation which, when executed by one or more processors, cause implementation of the invention according to any of clauses A12-A18 Methods.
  • the term “if” may be contextually interpreted as “when” or “once” or “in response to determining” or “in response to detecting”.
  • the phrases “if it is determined” or “if the [described condition or event] is detected” may be interpreted, depending on the context, to mean “once it is determined” or “in response to the determination” or “once the [described condition or event] is detected. ]” or “in response to detection of the [described condition or event]”.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Complex Calculations (AREA)
  • Advance Control (AREA)

Abstract

一种集成电路装置、电子设备、板卡和使用集成电路装置来执行矩阵乘的方法。该集成电路装置可以包括在组合处理装置的计算处理装置中,该计算处理装置可以包括一个或多个集成电路装置,组合处理装置还可以包括接口装置和其他处理装置,计算处理装置与其他处理装置进行交互,共同完成用户指定的计算操作,组合处理装置还可以包括存储装置,该存储装置分别与设备和其他处理装置连接,用于存储该设备和其他处理装置的数据。本方案可以降低内部设备与外部存储装置之间的数据传输量,最大程度地减少了由于带宽限制而导致的I/O瓶颈问题,可以提高集成电路装置的整体性能。

Description

用于矩阵乘操作的集成电路装置、计算设备、系统和方法
相关申请的交叉引用
本申请要求于2020年12月30日申请的,申请号为202011610669.4,名称为“用于矩阵乘操作的集成电路装置、计算设备、系统和方法”的中国专利申请的优先权。
技术领域
本披露一般地涉及数据处理领域。更具体地,本披露涉及一种用于矩阵乘操作的集成电路装置、板卡、计算设备、计算系统和方法。
背景技术
人工智能领域通常涉及大量的数据处理和运算,这其中包括各种类型数据的矩阵乘操作。以当前人工智能领域内的机器学习为例,其中的许多计算任务都涉及到大规模的矩阵乘运算,特别是大矩阵的相乘操作。进一步以机器学习中的深度学习为例,其中就包含着类型和数目众多的矩阵乘操作,包括例如全连接层中的权值矩阵和输入向量的矩阵乘操作以及卷积层中的输入向量和卷积核的矩阵乘操作。可以想到的是当涉及的矩阵乘运算数据量和数据尺度越大,则对计算平台(特别是对片上系统)的存储量的要求就越高。
在现有的矩阵乘运算中,通常会利用中央处理器(“CPU”)或者图像处理单元(“GPU”)等处理器进行运算。然而,由于处理器受制于内部寄存器资源的容量限制,庞大的数据运算量可能会导致处理器与外部存储设备之间产生大量的数据交互。由于处理器与外部存储器之间的输入/输出(“I/O”)总线的带宽是有限的,由此就很可能会出现严重的I/O瓶颈问题,由此造成数据传递的延迟并且极大地降低了并行运算时的运算效率。进一步,不仅I/O总线的带宽限制会成为系统性能的瓶颈,而且处理器与外部存储设备间大量的I/O访存量也会对计算和功耗开销带来不利的影响。
发明内容
为了至少解决在上文中所提到的技术问题,本披露提供了一种能够高效地执行矩阵乘操作的硬件架构和运算方式,由此减少与外部存储设备的数据传输量,最大程度地降低总线带宽限制带来的I/O瓶颈问题的解决方案,提高矩阵乘的运算效率。具体地,本披露在如下的多个方面中提供前述的解决方案。
在第一方面中,本披露公开了一种用于矩阵乘操作的集成电路装置,包括:接口单元,其配置成从外部存储器获取用于所述矩阵乘操作的矩阵数据,其中所述矩阵数据包括第一矩阵和第二矩阵,其中第一矩阵和第二矩阵被分别划分成N 2个第一矩阵块和N 2个第二矩阵块,并且所述第一矩阵和第二矩阵的矩阵乘操作包括基于N 2个第一矩阵块和N 2个第二矩阵块的N 2个矩阵乘任务,其中N是大于或等于2的正整数;N 2个主计算单元,该N 2个主计算单元依次连接以形成数据传递的回路,其中每个主计算单元配置成执行N 2个矩阵乘任务中的相应一个,并且包括:多个存储区,其配置成存储用于执行矩阵乘任务的矩阵块和中间结果;以及控制单元,其配置成与相邻的主计算单元进行矩阵块交换。
在执行上述相应一个所述矩阵乘任务中,每个所述主计算单元配置成:通过所述接口单元获取与其矩阵乘任务关联的一个第一矩阵块和一个第二矩阵块,并且分别存储于第一 存储区和第二存储区中;对所述一个第一矩阵块和一个第二矩阵块执行矩阵乘操作,以得到一个中间结果;通过所述控制单元并且利用所述第一存储区和第二存储区来与相邻的主计算单元执行N-1次矩阵块交换,并且对每次交换到的第一矩阵块和第二矩阵块执行矩阵乘操作,以分别得到N-1个中间结果;以及对N个中间结果执行求和操作,以完成与其关联的矩阵乘任务。
在第二方面中,本披露公开了一种板卡,包括前述和稍后将在多个实施例中描述的集成电路装置。
在第三方面中,本披露公开了一种计算设备,包括前述和稍后将在多个实施例中描述的板卡。
在第四方面中,本披露公开了一种计算系统,包括前述和稍后将在多个实施例中描述的计算设备。
在第五方面中,本披露公开了一种使用前述和稍后将在多个实施例中描述的集成电路装置来执行矩阵乘操作的方法,包括:使用集成电路装置的接口单元从外部存储器获取用于所述矩阵乘操作的矩阵数据,其中所述矩阵数据包括第一矩阵和第二矩阵,其中第一矩阵和第二矩阵被分别划分成N 2个第一矩阵块和N 2个第二矩阵块,并且所述第一矩阵和第二矩阵的矩阵乘操作包括基于N 2个第一矩阵块和N 2个第二矩阵块的N 2个矩阵乘任务,其中N是大于或等于2的正整数;以及使用每个所述主计算单元来执行以下操作:通过所述接口单元获取与其矩阵乘任务关联的一个第一矩阵块和一个第二矩阵块,并且分别存储于第一存储区和第二存储区中;对所述一个第一矩阵块和一个第二矩阵块执行矩阵乘操作,以得到一个中间结果;通过所述控制单元并且利用所述第一存储区和第二存储区来与相邻的主计算单元执行N-1次矩阵块交换,并且对每次交换到的第一矩阵块和第二矩阵块执行矩阵乘操作,以分别得到N-1个中间结果;以及对N个中间结果执行求和操作,以完成与其关联的矩阵乘任务。
在第六方面中,本披露提供一种计算机程序产品,其包括用于执行矩阵乘操作的程序指令,当所述程序指令由一个或多个处理器来执行时,使得实现上述和稍后将在多个实施例中描述的方法。
通过利用本披露上述的集成电路装置、计算设备、计算系统、板卡和方法,可以充分利用片上系统的片内资源,在主计算单元之间实现数据的共享和传递,由此可以显著减少与外部存储器之间的I/O数据交互,从而实现数据传送和相乘操作的高效并行执行。进一步,通过结合硬件架构来对矩阵进行多级拆分,本披露的方案简化了矩阵乘操作的复杂度并且支持对超大矩阵的矩阵乘操作。另外,通过显著减少与外部存储器的数据交互,本披露的方案还提升了矩阵乘运算的执行效率,降低由于片上与片外I/O带宽限制造成的运算性能瓶颈问题,从而可以提升集成电路装置、计算设备、计算系统或板卡的整体性能。
附图说明
通过参考附图阅读下文的详细描述,本披露示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本披露的若干实施方式,并且相同或对应的标号表示相同或对应的部分,其中:
图1是示出根据本披露实施例的集成电路装置的示意架构图;
图2是示出根据本披露实施例的单个主计算单元的结构示意图;
图3是示出根据本披露实施例的“2*2”主计算单元的架构图;
图4a和图4b是示出根据本披露实施例的“2*2”主计算单元用于卷积矩阵乘操作的框图;
图5a和图5b是示出根据本披露实施例的“2*2”计算子单元用于卷积矩阵乘操作的结构框图;
图6是示出根据本披露实施例的集成电路装置所执行的流水操作示意图;
图7是示出根据本披露实施例的“3*3”主计算单元的结构架构图;
图8是示出根据本披露实施例的用于矩阵乘操作的板卡;
图9是示出根据本披露实施例的用于矩阵乘操作的计算系统;
图10是示出根据本披露实施例的用于执行矩阵乘操作的方法的流程图;
图11是示出根据本披露实施例的一种组合处理装置的结构图;以及
图12是示出根据本披露实施例的一种板卡的结构示意图。
具体实施方式
下面将结合本披露实施例中的附图,对本披露实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本公开一部分实施例,而不是全部的实施例。基于本披露中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本披露公开的方案保护的范围。
下面结合附图来详细描述本披露的具体实施方式。
图1是示出根据本披露实施例的用于矩阵乘操作的集成电路装置102的示意架构图。为了便于理解本披露的方案,图中还示出与集成电路装置102进行信息交互的外部存储器104。在一个实施场景中,该外部存储器可以是动态随机存储器(Dynamic Random Access Memory,“DRAM”),并且与本披露的矩阵乘操作相关的矩阵数据可以存储于该动态随机存储器。如本领域技术人员可以理解,矩阵乘操作可以涉及第一矩阵和第二矩阵,并且第一矩阵和第二矩阵可以被划分成N 2个第一矩阵块和N 2个第二矩阵块,这里N是大于或等于2的正整数。例如N=2时,可以将第一矩阵和第二矩阵划分成4个矩阵块,例如对于“4*4”的第一矩阵或第二矩阵,可以将其划分成4个“2*2”的第一矩阵块和第二矩阵块。再例如N=3时,可以将第一矩阵和第二矩阵划分成9个矩阵块,例如对于“6*6”的第一矩阵或第二矩阵,可以将其划分成9个“2*2”的第一矩阵块和第二矩阵块。通过前述的分块处理,本披露的方案可以将大的矩阵乘操作划分成N 2个矩阵乘任务,以便由下面将要详细描述的本披露的主计算单元来执行。
进一步如图1中所示,本披露提供的集成电路装置102可以包括接口单元106和N 2个主计算单元108。在一个应用场景中,可以使用直接存储器访问(Direct Memory Access,“DMA”)接口作为前述的接口单元,以便将外部存储器的矩阵数据发送至多个主计算单元108,例如图中示例性示出的五个主计算单元和中间以黑点省略示出的一个或多个主计算单元。可以看出,本披露的N 2个主计算单元可以构成一个“N*N”的计算阵列,以便并行地执行矩阵乘操作。在一个实施例中,本披露的N 2个主计算单元依次连接以形成数据传递的回路,从而可以在连续的回路上向其他的主计算单元 传递包括上述的第一矩阵块或第二矩阵块中的部分行块和列块的数据,由此可以执行上文提到的N 2个矩阵乘任务中的相应一个。下面将结合图2对本披露的主计算单元进行详细的描述。
如图2中所示,本披露的主计算单元可以包括M 2个计算子单元,即“M*M”的计算阵列,其中M是大于或等于2的正整数。根据不同的实施场景,M可以等于或不等于前述的N,例如N=2并且M=2,或N=2并且M=3。进一步,主计算单元可以包括多个存储区,例如图中所示出的共享存储区和与每个计算子单元所关联的私有存储区。在一个实施例中,共享存储区可以是与私有存储区不同的存储区。在另一个实施例中,私有存储区可以是共享存储区中专门划分出以用于计算子单元临时存储的存储空间。在一个实现场景中,主计算单元中的多个存储区可以配置成存储用于执行矩阵乘任务的矩阵块和中间结果。
为了实现与形成数据传递回路的相邻主计算单元的数据交互,本披露的主计算单元还包括控制单元,其配置用于与相邻的主计算单元进行矩阵块交换。由此,借助于集成电路装置和外部存储器之间的接口单元以及各个主计算单元的控制单元,本披露的方案通过令集成电路装置中的多个主计算单元从外部存储器获取各自矩阵乘任务的一部分矩阵块数据,并且通过数据交互从相邻连接的一个或多个主计算单元来获得另一部分(或另外多个部分)矩阵块数据,从而可以获取完成相应矩阵乘任务所需要的矩阵块数据并且基于此完成相应的矩阵乘任务。
具体来说,在执行相应一个上述的矩阵乘任务中,每个主计算单元可以配置成通过接口单元获取与其矩阵乘任务相关的一个第一矩阵块(其来自于第一矩阵)和一个第二矩阵块(其来自于第二矩阵),并且将其分别存储于第一存储区和第二存储区。此处,第一存储区和第二存储区可以是从共享存储区中分配的两个独立的存储空间,以用作存储中间数据的缓冲区。
在获得上述的一个第一矩阵块和一个第二矩阵块后,本披露的主计算单元可以对该第一矩阵块和第二矩阵块来执行矩阵乘操作,从而获得一个中间结果。如前所述,这里的第一矩阵块和第二矩阵块的矩阵乘操作可以由主计算单元中的M 2个计算子单元来并行流水执行。此后,主计算单元可以通过控制单元并且利用第一存储区和第二存储区来与相邻的主计算单元执行N-1次的矩阵块交换,并且对每次交换到的第一矩阵块和第二矩阵块执行矩阵乘操作,从而得到N-1个中间结果。例如,当N=2时,即4个主计算单元串行连接,则在一个主计算单元可以从相邻连接的两个主计算单元来获得另一个第一矩阵块和第二矩阵块,从而再获得一个中间结果。在获得N个中间结果后,本披露的主计算单元可以将这些中间结果进行求和,从而完成与其关联的一个矩阵乘任务。
如上所提到的,本披露的主计算单元利用其M 2个计算子单元来执行具体的矩阵乘任务。基于这样的布置,本披露的矩阵乘操作可以涉及当第一矩阵块和第二矩阵块还可以再次划分的情形。具体来说,第一矩阵块和第二矩阵块还可以被分别划分成M 2个第一矩阵子块和M 2个第二矩阵子块。基于此,前述一个主计算单元的一个矩阵乘任务可以包括基于M 2个第一矩阵子块和M 2个第二矩阵子块的M 2个矩阵乘子任务。进一步,M 2个计算子单元的每个可以配置成执行M 2个矩阵乘子任务中的对应一个。
具体而言,在执行对应一个矩阵乘子任务中,每个计算子单元可以配置成执行M次的矩阵乘操作,从而获得M个中间子结果。特别地,计算子单元可以从共享存储区(例如第一存储区和第二存储区)分别获取与其矩阵乘子任务关联的一个第一矩阵子块和一个第二矩阵子块。接着,计算子单元可以对一个第一矩阵子块和对应的一个第二矩阵子块执行矩阵乘操作,以得到一个中间子结果。最后,通过对M个中间子结果执行求和操作,以完成与其关联的矩阵乘子任务。
基于上文本披露的集成电路装置的内部架构和矩阵划分,本披露的方案也实现了高度的并行操作。特别地,N 2个主计算单元可以配置成并行地执行与各自关联的矩阵乘任务,并且所述M 2个计算子单元可以配置成并行地执行与各自关联的矩阵乘子任务。另外,就本披露的矩阵划分而言,该划分方式可以根据加农(“cannon”)算法规则来划分。例如,本披露参与矩阵乘操作的第一矩阵和第二矩阵可以根据加农算法规则在主计算单元这一层级处划分成N 2个第一矩阵块和N 2个第二矩阵块。接着,在计算子单元这一层级处,可以将一个第一矩阵块和一个第二矩阵块根据加农算法规则来进一步划分,从而得到M 2个第一矩阵子块和M 2个第一矩阵子块。
通过上述结合图1和图2的描述,本领域技术人员可以理解本披露通过对大(或超大)矩阵间的矩阵乘操作进行多次(或者说多轮)的分块处理,并且通过对应的主计算单元和计算子单元来执行,从而实现了矩阵乘操作的并行流水操作。由此,本披露的方案在矩阵乘操作方面实现了简化矩阵乘运算的复杂度、加速矩阵乘运算的显著优势。进一步,通过从外部存储器获取全部的矩阵数据,并且在主计算单元通过控制单元来进行相互的交换,从而避免了与外部存储器之间频繁的数据交互,突破了现有的I/O交互的瓶颈。进一步,本披露的主计算单元和计算子单元的数目可以根据计算场景来灵活设置,并且可以通过级联的方式来实现对任意规模的矩阵乘操作,从而架构布置灵活并且支持各类的矩阵乘运算场景。
图3是示出根据本披露实施例的2 2(即4)个主计算单元的架构图。如图3中所示,该4个主计算单元(包括主计算单元0~主计算单元3)通过控制单元互联,以形成一个“2*2”的计算阵列。如前结合图1和图2所描述的,该4个主计算单元可以配置用于执行4个第一矩阵块和4个第二矩阵块之间的矩阵乘操作,并且每个主计算单元可以执行4个矩阵乘任务中的一个矩阵乘任务。进一步,图3中还示出了每个主计算单元所包括的M 2个计算子单元。通过将一个矩阵乘任务分配给M 2个计算子单元来执行,可以实现并行的流水操作,从而加速矩阵乘操作并满足各种应用场景的需求。
在一个应用场景中,本披露的集成电路装置可以应用在人工智能领域,特别是包括深度神经网络的机器学习中。例如,本披露的集成电路装置可以对接收到的第一矩阵和第二矩阵执行神经网络中所涉及的卷积运算,其中涉及大量的矩阵乘操作。为了更好地理解本披露的集成电路装置如何应用于这样的应用场景,下面将结合图4a和图4b来示例性描述本披露的集成电路装置根据加农算法来执行卷积运算中所涉及的矩阵乘操作。
图4a示出了根据本披露实施例的一种集成电路装置结构的示意图,其包括互联的4个(即“2*2”个)主计算单元,即主计算单元0~主计算单元3。另外,为了图示的简化,图中并没有示出主计算单元中所包括的多个计算子单元。进一步,图4b示意性地示出待运算的两个输入矩阵及其计算结果的矩阵块。具体地,两个待执行矩 阵乘操作的矩阵分别是包括卷积结果梯度的第一矩阵和包括卷积输入的第二矩阵。进一步,二者执行矩阵乘操作后所得到的结果矩阵,即卷积权重梯度。
如图4a所示,四个主计算单元(每个即为图1中的主计算单元102)以顺时针方向顺序编号为主计算单元0、主计算单元1、主计算单元2和主计算单元3,并且已经依次连接以形成闭合的回路(或称环路)。具体地,相邻的主计算单元0与1之间具有双向的通信连接,例如主计算单元间可以通过DMA进行双向通信。类似地,相邻的主计算单元1和2之间、主计算单元2和3之间、以及主计算单元3和0之间也分别具有两条双向的通信连接,以便在控制单元的控制下进行矩阵块的相互传递。另外,每个主计算单元也可以经由接口单元分别与外部存储器(图中虚框示出)进行通信连接,以获得各自执行计算任务所需的矩阵块数据(本例中为卷积结果梯度和卷积输入)。
如本领域技术人员所知,作为本例矩阵乘结果的卷积权重梯度可以用于在神经网络反向传播过程中对前向传播中的卷积结果的梯度进行更新。在一个运算场景中,卷积权重梯度计算相当于作为本例中的第一矩阵的卷积结果梯度(当为四维矩阵时,其维度可以表示为如图中所示的NiHiWiCi)与作为本例中的第二矩阵的卷积输入(当为四维矩阵时,其维度表示为如图中所示的NoHoWoCo)的乘积累加计算。这里,N表示样本数,H表示矩阵高度,W表示矩阵宽度,C表示通道数。进一步,根据矩阵乘的运算规则,输入矩阵“卷积结果梯度”可以表示为Ci*NiHiWi,而输入矩阵“卷积输入”可以表示为NoHoWo*Co,二者在NiHiWi和NoHoWo方向上做卷积权重梯度计算(例如乘加运算),最终获得的输出矩阵“卷积权重梯度”可以表示为Kh*Kw*Ci*Co(其中,Kh表示输出矩阵的高度,Kw表示输出矩阵的宽度,Ci表示输入矩阵“卷积结果梯度”的通道数,Co表示输入矩阵“卷积输入”的通道数)。为了简明的目的,图中仅示出Ci*Co方向上的卷积权重梯度计算,也即本披露的矩阵乘操作。
基于上述的示例性数据摆放规则(包括例如根据加农算法的矩阵划分方式)和闭合成环的四个主计算单元的架构,可以将存储在外部存储器中的第一矩阵“卷积结果梯度”和第二矩阵“卷积输入”分别划分成四个矩阵块。为了简化的目的,将第一矩阵“卷积结果梯度”划分的四个矩阵块表示为如图4b所示的A00、A01、A10和A11。类似地,将第二矩阵“卷积输入”划分的四个矩阵块表示为B00、B01、B10和B11。相应地,作为结果矩阵的输出矩阵“卷积权重梯度”也可以划分成四个矩阵块C00、C01、C10和C11。
基于上述的数据分块,各个主计算单元可以分别执行下面的算式(1)至(4),以便计算获得各自对应的卷积权重梯度C00、C01、C11和C10:
C00=A00*B00+A01*B10    (1)
C01=A00*B01+A01*B11    (2)
C11=A10*B01+A11*B11    (3)
C10=A10*B00+A11*B10    (4)
具体来说,本披露的方案可以将图4a示出的四个主计算单元0、1、2和3分别用于执行对应上述式(1)至(4)中的计算任务,以分别获得C00、C01、C11和C10。在利用加农算法执行上述矩阵块相乘的运算场景中,可以根据加农算法的规则将如图4b示出的输入矩阵“卷积结果梯度”的A10和A11进行位置交换,并且将输入矩阵“卷积输入”的B01和B11进行位置交换,如图4b中箭头所示。
如前文所述,每个主计算单元可以从外部存储器接收其相应的一个第一矩阵块和一个第二矩阵块,并且执行对应的矩阵乘计算。例如,主计算单元0可以经由接口单元接收来自于外部存储器的第一矩阵“卷积结果梯度”和第二矩阵“卷积输入”的一个第一矩阵块“A00”和一个第二矩阵块“B00”,并且根据式(1)执行作为其矩阵乘任务一部分的第一矩阵乘子任务(A00*B00),这里“*”表示矩阵乘操作。类似地,主计算单元1经由接口单元接收其对应的一个第一矩阵块和一个第二矩阵块,即(A01和B11),并且根据式(2)执行其第一矩阵乘任务(A01*B11)。同样地,主计算单元2和3经由数据接口分别接收各自的一个第一矩阵块和一个第二矩阵块(A10和B01)和(A11和B10),并且根据式(3)和(4)分别执行各自的第一矩阵乘任务(A10*B01)和(A11*B10)。
在每个主计算单元完成从外部存储器接收矩阵块的数据并执行矩阵乘任务期间,其还可以接收来自于互联的主计算单元的另一第一矩阵块和另一第二矩阵块。如前所述,本披露的每个主计算单元可以利用双向的通信连接分别向相邻的主计算单元发送其从外部存储器接收到的部分矩阵块数据,以作为所述相邻的主计算单元的另一(或第二)矩阵乘任务的相应矩阵块数据。
如前所述,获得“C00”可以视为主计算单元0的矩阵乘任务,并且根据式(1)可知,完成“C00”矩阵乘任务中的第二矩阵乘任务所需的另一第一矩阵块和第二矩阵块分别为“A01”和“B10”。进一步,从图4a中可以看出,与主计算单元0相邻的主计算单元1可以将其先前从外部存储器接收到的第一矩阵块“A01”发送给主计算单元0。对应地,与主计算单元0相邻的主计算单元3可以将其先前从外部存储器接收到的第一矩阵块B10发送给主计算单元0。由此,主计算单元0可以通过对接收到的矩阵块数据“A01”和“B10”执行矩阵乘操作来完成其第二矩阵乘任务。类似地,主计算单元1、2与3也可以利用双向的通信连接接收相邻主计算单元发送的矩阵块数据,即对应的一个第一矩阵块和一个第二矩阵块,如图中所示的(“A00”和“B01”)、(“A11”和“B11”)以及(“A10”和“B00”)。接着,每个主计算单元可以根据式(1)至(4)执行各自的第二矩阵乘任务,并且通过将第一和第二矩阵乘任务的中间结果进行求和来获得每个主计算单元各自关联的矩阵乘结果,即本例中的卷积权重梯度C00、C01、C11和C10,从而完成各自的矩阵乘任务。
通过上文结合图4a和图4b的描述可以看出,本披露的每个主计算单元仅需从外部存储器接收部分矩阵块数据,而另一部分矩阵块数据的接收则更好地利用了主计算单元之间的高速通信总线。由此,本披露的方案显著减少了主计算单元与外部存储器的数据交互,从而明显降低了片上与片外I/O的数据传输量并克服了由于带宽限制引起的I/O瓶颈。需要注意的是,图4a中所示出的四个主计算单元形成闭合环路仅仅是示例性的而非限制性的。本领域技术人员根据具体的应用场景,也可以预先布置其他合适数目的主计算单元以形成处理阵形和数据传递环路,例如图7中所示出的(稍后详细描述)。
如前所述,本披露的矩阵乘操作可以由每个主计算单元内的多个计算子单元来执行具体的矩阵乘操作。基于这样的多个计算子单元设置,本披露的第一矩阵块和第二矩阵块可以被进一步划分为多个第一矩阵子块和第二矩阵子块,并且由此每个矩阵乘任务(例如上文的式(1)、(2)、(3)或(4))可以划分为与所述多个计算子单 元中的每个相对应的多个矩阵乘子任务。基于此,基于与其关联的矩阵乘子任务,每个计算子单元可以从所述共享存储区读取对应的一个第一矩阵子块和一个第二矩阵子块来执行矩阵运算。为了更好的理解,下文将结合图5a和图5b讨论每个计算子单元如何根据加农算法的规则来完成各自对应的矩阵乘子任务。
图5a和图5b是示出根据本披露实施例的“2*2”计算子单元用于卷积矩阵乘操作的结构框图。为了方便描述和理解,下文仅结合图5a和图5b来描述前文的卷积权重梯度计算中涉及主计算单元0根据加农算法来执行其第一矩阵乘任务“A00*B00”。
如图5a所示,主计算单元0包括共享存储区和顺序编号为0、1、2和3的四个计算子单元(每个即图2中的计算子单元)。在矩阵乘运算期间,每个计算子单元可以从共享存储区接收(或称加载)各自的第一矩阵子块和第二矩阵子块的矩阵数据。具体地,根据各自关联的矩阵乘子任务,图5a中的每个计算子单元从共享存储区接收各自的一个第一矩阵子块和一个第二矩阵子块,并且执行相应的运算以获得一个中间子结果。重复前述步骤,每个计算子单元可以获得另一个中间子结果。最后,通过将前述两个中间子结果进行求和,从而获得针对于其矩阵乘子任务的中间结果。
如图5b中所示,利用前述存储于共享存储区中的第一矩阵块“卷积结果梯度”A00(例如为四维矩阵,表示为Ci*NiHiWi)和第二矩阵块“卷积输入”B00(例如为四维矩阵,表示为NoHoWo*Co)作为两个输入数据,来执行主计算单元0的第一矩阵乘任务“卷积权重梯度”(A00*B00)(图中为了简化的目的而仅示出了Ci*Co方向)。为此,根据加农算法可以将A00划分成四个第一矩阵子块a00、a01、a10和a11,并且将B00划分成四个第二矩阵子块b00、b01、b10和b11,并且该八个矩阵子块存储于共享存储区中。进一步,根据加农算法,输出矩阵(A00*B00)的结果C00也可以划分成四个子块c00、c01、c10和c11。基于此,根据加农算法中矩阵乘的运算规则,可以通过下式(5)至(8)获得c00、c01、c11和c10:
c00=a00*b00+a01*b10     (5)
c01=a00*b01+a01*b11     (6)
c11=a10*b01+a11*b11      (7)
c10=a10*b00+a11*b10      (8)
根据本披露的方案,可以令图5a所示出的四个计算子单元0、1、2和3分别执行上式(5)至(8)中的计算,即分别执行各自的矩阵乘子任务以得到对应的c00、c01、c11和c10。以获得c00的矩阵乘子任务为例,执行该矩阵乘子任务的计算子单元0的矩阵子块包括a00、b00、a01和b10。同样地,对于获得c11的矩阵乘子任务来说,执行该子任务的计算子单元2的矩阵子块为a10、b01、a11和b11。
与结合图2所做描述类似,在利用加农算法进行计算时,可以将如图5b左侧所示的“卷积结果梯度”A00的a10和a11进行位置交换,并且将“卷积输入”B00的b01和b11进行位置交换。由此,执行获得c01的矩阵乘子任务的计算子单元1的第一和第二矩阵子块为a00、b01、a01和b11,而执行获得c10的矩阵乘子任务的计算子单元3的第一和第二矩阵子块为a10、b00、a11和b10。
如图5a上图所示,四个计算子单元中的每个都可以从共享存储区接收各自的第一和第二矩阵子块。以计算子单元0为例,其可以从共享存储区加载(a00和b00)以执行(a00*b00)的矩阵乘计算。接着,如图5a下图所示,计算子单元0可以从共 享存储区接着加载(a01和b10)部分以执行(a01*b10)的矩阵乘计算。最后,通过将(a00*b00)和(a01*b10)的计算结果相加,计算子单元0完成了与其关联的矩阵乘子任务。对于计算子单元1、2和3,其也执行类似于计算子单元0的操作,从而完成各自的矩阵乘子任务。
基于上文的描述,本领域技术人员将理解到,主计算单元0的第一矩阵乘任务(如A00*B00)中的每个矩阵乘子任务获得的计算结果只是中间子结果。因此,还需要进一步完成第二矩阵乘任务(如A01*B10)所对应的多个矩阵乘子任务来获得另一中间结果,从而可以通过对两个中间结果求和来获得如图5b示出的主计算单元0关联的矩阵乘任务C00的最终计算结果。具体来说,计算子单元0例如可以根据式(5)执行第一矩阵乘任务(A00*B00)对应的矩阵乘子任务,并且将获得的c00作为第一子c00 1。接着,利用计算子单元0执行C00的第二矩阵乘任务(A01*B10)中对应的矩阵乘子任务,以获得第二子c00 2。最后。可以将两个子c00 1和c00 2进行求和操作,从而获得输出矩阵块C00中的矩阵块c00。考虑到算式(5)右侧包括两部分的加法操作并且由此c00 2由二个中间结果相加获得,因此c00 1也可以与c00 2的第一中间结果和第二中间结果顺序相加,从而获得矩阵子块c00,具体操作将稍后参考图6中第6个和第7个时间片的计算操作列来描述。
执行与计算子单元0相类似的操作,计算子单元1、2和3也可以分别获得C00中的矩阵子块c01、c11和c10,从而如图5b右侧所示出的四个矩阵子块c00、c01、c11和c10构成主计算单元0执行矩阵乘任务所获得的输出矩阵块C00。由于每个计算子单元的中间计算结果(例如c00、c01、c11和c10)也可以存储于对应的主计算单元的共享存储区中而无需存储于外部存储器。由此,本披露的方案可以减少与外部存储器之间的数据交换,从而降低了由于外部带宽限制而导致的I/O瓶颈。
进一步,根据上文的描述,本领域技术人员可以理解图5a中示出的主计算单元包括四个计算子单元仅仅是示例性的而非限制性的。根据不同的应用场景,本领域技术人员基于本披露的教导可以预先设置不同数目的计算子单元,或者启用或禁用不同数目的计算子单元,以便来执行例如加农算法的矩阵乘计算。
图6是示出根据本披露实施例的集成电路装置(包括主计算单元及其计算子单元)所执行的流水操作示意图。特别地,图6以图5a和图5b示出的主计算单元0及其计算子单元0执行卷积运算为例,按时间顺序示出主计算单元0、计算子单元0、外部存储器和共享存储区之间的数据传输和具体的操作(包括例如数据加载和矩阵乘操作)。
具体地,图6以行的形式示出从第1个时间片起到第8个时间片结束期间,主计算单元0及其计算子单元0在各个时间片内执行相应的数据接收、发送、加载或矩阵乘运算,从而最终获得关于卷积权重梯度的输出矩阵块C00中的矩阵子块c00的流水操作。进一步,以列的形式示出在每个时间片内执行的四类操作。如图中所示,第1列表示从外部存储器(例如经由DDR)加载数据的操作,例如从其接收本披露所讨论的第一矩阵块和第二矩阵块;第2列表示主计算单元间的数据传递,例如主计算单元0的共享存储区向相邻的主计算单元1和3发送其第一和第二矩阵块,并且从主计算单元1和3接收它们的第一和第二矩阵块以作为主计算单元0执行第二矩阵乘任务的操作数据;第3列表示计算子单元0的数据加载;第4列表示计算子单元0内执行的 矩阵乘操作。根据前述的时间片和操作划分,主计算单元0在相应的时间片执行对应的操作。例如,在第1时间内片,主计算单元0的共享存储区仅执行存储从外部存储器(即“片外”)接收到B00的操作。又例如,在第2时间片内,主计算单元的共享存储区执行从外部存储器接收A00而计算子单元0执行从共享存储区加载B00中的b00的操作。
为了高效地利用片上的I/O和计算资源,本披露的片上操作可以是乒乓流水操作。具体来说,根据本披露的方案,可以对片上存储资源进行乒(“ping”)和乓(“pong”)两部分的划分。在一个实施例中,当ping存储资源在用于加载数据时,pong存储资源用于进行矩阵乘计算;相反,当ping存储资源在用于矩阵乘计算时,pong存储资源用于加载数据。基于这样的资源分配,本披露的主计算单元可以执行并行的乒乓流水操作。
从图中可以看出,在第1时间片中,主计算单元0从外部存储器加载B00,并且存储于共享存储区的乒部分。在第2时间片中,主计算单元0从外部存储器加载A00,并且存储于共享存储区的乒部分。同时,并行地可以加载B00的b00至计算子单元0。在第3时间片中,可以加载A00的a00至计算子单元0。另外,在第3和第4时间片期间,主计算单元0通过控制单元发送A00至互联的主计算单元1,发送B00至互联的主计算单元3。同时,经由控制单元接收来自于主计算单元1的A01和来自于主计算单元3的B10。
在第4时间片的数据加载列中,可以加载B00的b10和A00的a01至计算子单元0;同时,在该第4时间片的矩阵乘操作列中,计算A00和B00的a00*b00以获得中间子结果。在第5时间片的数据加载列中,可以加载B10的b00和A01的a00至计算子单元0;同时,在该第5时间片的计算操作列中,计算A00和B00的a01*b10以获得中间子结果,并且将该中间子结果与上一时间片的中间子结果进行累加,以获得第5时间片处的中间结果。在第6时间片的数据加载列中,可以加载B10的b10和A01的a01至计算子单元0;同时,在第6时间片的矩阵乘操作列中,计算A01和B10的a00*b00以获得中间子结果,并且将该中间子结果与上一时间片的中间子结果进行累加,以获得第6时间片的中间结果。在第7时间片的矩阵乘操作列中,计算A01和B10的a01*b10以获得中间子结果,并且将该中间子结果与上一时间片的中间子结果进行累加,以获得输出矩阵块C00的矩阵子块c00。
在上述第3~第7时间片内执行数据加载和计算的期间,前述片上存储资源的乓部分用于从外部存储器接收下一组B00(B00')和A00(A00')以用于主计算单元0执行其第一矩阵乘任务。接着,从第8个时间片开始,计算子单元0将上一片计算输出的C00的c00存储至共享存储区。同时加载下一组B00'的b00和A00'的a00至计算子单元0,以便在下一时间片进行计算(未示出)。
类似地,主计算单元0的计算子单元1、2和3,以及不同主计算单元及其计算子单元也执行上述8个时间片的类似操作,以获得各自输出矩阵的对应矩阵块。由于输入矩阵“卷积结果梯度”和“卷积输入”可以是多维结构,因此可以先计算NHW三个方向上的计算结果进行累加。然后,在两个输入矩阵的Ci和Co维度上循环执行上述计算,以得到输出矩阵“卷积权重梯度”的计算结果。
图7是示出根据本披露实施例的“3*3”主计算单元的结构架构图。从图7所示内容可以看出,该“3*3”主计算单元通过形成计算阵列和数据传递的回路,可以执行图7上部所示出的矩阵乘操作。与前述“2*2”主计算单元的操作不同,“3*3”主计算单元需要在相邻的主计算单元之间进行2次数据传递,而非“2*2”主计算单元中的一次数据传递。换句话说,对于本披露的方案,“N*N”个主计算单元需要在相邻的主计算单元之间进行(N-1)次数据传递或交换。为了便于理解,图7下部示出了经过第一和第二轮数据传递后各个主计算单元所获得的第一矩阵块和第二矩阵块数据。以主计算单元5为例,在从外部存储器获得其第一矩阵块“A23”和第二矩阵块“B32”后,在第一轮的数据传递中,其从主计算单元6接收另一第一矩阵块“A21”和从主计算单元8接收第二矩阵块“B12”,以执行其对应的矩阵乘任务“A21*B12”。此后,在第二轮的数据传递中,其从主计算单元6接收另一第一矩阵块“A22”和从主计算单元8接收第二矩阵块“B22”,以执行其对应的矩阵乘任务“A22*B22”。可以看出,通过图7中所示出的架构和矩阵划分,该“3*3”主计算单元可以支持将大的两个矩阵分别划分成两个“3*3”的矩阵块来执行矩阵乘操作。
图8是示出根据本披露实施例的用于矩阵乘操作的板卡800。如图8所示,该板卡包括四个如前结合图1-图7所描述的集成电路装置。可以理解的是尽管这里示出四个,但本领域技术人员可以本披露的教导来布置互联的P 2个集成电路装置,其中P是大于或等于2的正整数。利用包括该P 2个集成电路装置的板卡,本披露的方案可以对分别划分成“P 2*N 2*M 2”个矩阵块的第一矩阵和第二矩阵执行矩阵乘操作。
图9是示出根据本披露实施例的用于矩阵乘操作的计算系统900。如图9所示,该计算系统900包括四个服务器或主机,其中每个主机中布置有一个或多个图8中所示的板卡,以支持超大规模矩阵的矩阵乘操作。具体来说,当两个超大尺寸的矩阵进行相乘时,可以根据图9的计算系统来分别划分成四个矩阵块。接着,在每个主机上根据板卡的数目进一步划分每个矩阵块。以此类推,直到将参与矩阵乘计算的超大矩阵划分到本披露的计算子单元所支持的矩阵乘运算粒度。
图10是示出根据本披露实施例的用于执行矩阵乘操作的方法1000的流程图。结合上文的描述,可以理解的是方法1000可以由本披露的集成电路装置来执行,因此关于集成电路装置的描述也同样适用于下面针对方法1000的描述。
如图10中所示,在步骤1002处,方法1000使用集成电路装置的接口单元从外部存储器获取用于所述矩阵乘操作的矩阵数据。在一个实施例中,此处的矩阵数据包括第一矩阵和第二矩阵,其中第一矩阵和第二矩阵被分别划分成N 2个第一矩阵块和N 2个第二矩阵块,并且所述第一矩阵和第二矩阵的矩阵乘操作包括基于N 2个第一矩阵块和N 2个第二矩阵块的N 2个矩阵乘任务,其中N是大于或等于2的正整数。接着,针对于每个主计算单元,方法1000执行步骤1004~1010来完成主计算单元的矩阵乘任务。
具体地,在步骤1004处,方法1000通过所述接口单元获取与其矩阵乘任务关联的一个第一矩阵块和一个第二矩阵块,并且分别存储于第一存储区和第二存储区中。接着,在步骤1006处,方法1000对所述一个第一矩阵块和一个第二矩阵块执行矩阵乘操作,以得到一个中间结果。随后,在步骤1008处,方法1000通过控制单元并且利用所述第一存储区和第二存储区来与相邻的主计算单元执行N-1次矩阵块交换,并且对每次交换到的第一 矩阵块和第二矩阵块执行矩阵乘操作,以分别得到N-1个中间结果。最后,在步骤1010处,方法1000对N个中间结果执行求和操作,以完成与其关联的矩阵乘任务。
以上为了简明的目的,仅结合图10描述了本披露的方法。本领域技术人员根据本披露的公开内容也可以想到本披露的方法1000可以包括更多的步骤,并且这些步骤的执行可以实现前文结合图1-图9所描述的本披露的各类操作,此处不再赘述。
图11是示出根据本披露实施例的一种组合处理装置1100的结构图。如图11中所示,该组合处理装置1100包括计算处理装置1102、接口装置1104、其他处理装置1106和存储装置1108。根据不同的应用场景,计算处理装置中可以包括一个或多个集成电路装置1110,该集成电路装置可以配置用于执行本文结合附图1-图10所描述的矩阵乘操作。
在不同的实施例中,本披露的计算处理装置可以配置成执行用户指定的操作。在示例性的应用中,该计算处理装置可以实现为多核人工智能处理器。类似地,包括在计算处理装置内的一个或多个计算装置可以实现为人工智能处理器核或者人工智能处理器核的部分硬件结构。当多个计算装置实现为人工智能处理器核或人工智能处理器核的部分硬件结构时,就本披露的计算处理装置而言,其可以视为具有单核结构或者同构多核结构。
在示例性的操作中,本披露的计算处理装置可以通过接口装置与其他处理装置进行交互,以共同完成用户指定的操作。根据实现方式的不同,本披露的其他处理装置可以包括中央处理器(Central Processing Unit,CPU)、图形处理器(Graphics Processing Unit,GPU)、人工智能处理器等通用和/或专用处理器中的一种或多种类型的处理器。这些处理器可以包括但不限于数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,并且其数目可以根据实际需要来确定。如前所述,仅就本披露的计算处理装置而言,其可以视为具有单核结构或者同构多核结构。然而,当将计算处理装置和其他处理装置共同考虑时,二者可以视为形成异构多核结构。
在一个或多个实施例中,该其他处理装置可以作为本披露的计算处理装置(其可以具体化为人工智能例如神经网络运算的相关运算装置)与外部数据和控制的接口,执行包括但不限于数据搬运、对计算装置的开启和/或停止等基本控制。在另外的实施例中,其他处理装置也可以和该计算处理装置协作以共同完成运算任务。
在一个或多个实施例中,该接口装置可以用于在计算处理装置与其他处理装置间传输数据和控制指令。例如,该计算处理装置可以经由所述接口装置从其他处理装置中获取输入数据,写入该计算处理装置片上的存储装置(或称存储器)。进一步,该计算处理装置可以经由所述接口装置从其他处理装置中获取控制指令,写入计算处理装置片上的控制缓存中。替代地或可选地,接口装置也可以读取计算处理装置的存储装置中的数据并传输给其他处理装置。
附加地或可选地,本披露的组合处理装置还可以包括存储装置。如图中所示,该存储装置分别与所述计算处理装置和所述其他处理装置连接。在一个或多个实施例中,存储装置可以用于保存所述计算处理装置和/或所述其他处理装置的数据。例如,该数 据可以是在计算处理装置或其他处理装置的内部或片上存储装置中无法全部保存的数据。
在一些实施例里,本披露还公开了一种芯片(例如图12中示出的芯片1202)。在一种实现中,该芯片是一种系统级芯片(System on Chip,SoC),并且集成有一个或多个如图11中所示的组合处理装置。该芯片可以通过对外接口装置(如图12中示出的对外接口装置1206)与其他相关部件相连接。该相关部件可以例如是摄像头、显示器、鼠标、键盘、网卡或wifi接口。在一些应用场景中,该芯片上可以集成有其他处理单元(例如视频编解码器)和/或接口模块(例如DRAM接口)等。在一些实施例中,本披露还公开了一种芯片封装结构,其包括了上述芯片。在一些实施例里,本披露还公开了一种板卡,其包括上述的芯片封装结构。下面将结合图12对该板卡进行详细地描述。
图12是示出根据本披露实施例的一种板卡1200的结构示意图,其中图8所示板卡可以视为板卡1200的一种具体化形式。如图12中所示,该板卡包括用于存储数据的存储器件1204,其包括一个或多个存储单元1210。该存储器件可以通过例如总线等方式与控制器件1208和上文所述的芯片1202进行连接和数据传输。进一步,该板卡还包括对外接口装置1206,其配置用于芯片(或芯片封装结构中的芯片)与外部设备1212(例如服务器或计算机等)之间的数据中继或转接功能。例如,待处理的数据可以由外部设备通过对外接口装置传递至芯片。又例如,所述芯片的计算结果可以经由所述对外接口装置传送回外部设备。根据不同的应用场景,所述对外接口装置可以具有不同的接口形式,例如其可以采用标准PCIE接口等。
在一个或多个实施例中,本披露板卡中的控制器件可以配置用于对所述芯片的状态进行调控。为此,在一个应用场景中,该控制器件可以包括单片机(Micro Controller Unit,MCU),以用于对所述芯片的工作状态进行调控。
根据上述结合图11和图12的描述,本领域技术人员可以理解本披露也公开了一种电子设备或装置,其可以包括一个或多个上述板卡、一个或多个上述芯片和/或一个或多个上述组合处理装置。
根据不同的应用场景,本披露的电子设备或装置可以包括服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本披露的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步,本披露的电子设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中,根据本披露方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器),而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中,云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容,从而可以根据终端设备 和/或边缘端设备的硬件信息,从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源,以便完成端云一体或云边端一体的统一管理、调度和协同工作。
需要说明的是,为了简明的目的,本披露将一些方法及其实施例表述为一系列的动作及其组合,但是本领域技术人员可以理解本披露的方案并不受所描述的动作的顺序限制。因此,依据本披露的公开或教导,本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步,本领域技术人员可以理解本披露所描述的实施例可以视为可选实施例,即其中所涉及的动作或模块对于本披露某个或某些方案的实现并不一定是必需的。另外,根据方案的不同,本披露对一些实施例的描述也各有侧重。鉴于此,本领域技术人员可以理解本披露某个实施例中没有详述的部分,也可以参见其他实施例的相关描述。
在具体实现方面,基于本披露的公开和教导,本领域技术人员可以理解本披露所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如,就前文所述的电子设备或装置实施例中的各个单元来说,本文在考虑了逻辑功能的基础上对其进行划分,而实际实现时也可以有另外的划分方式。又例如,可以将多个单元或组件结合或者集成到另一个系统,或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言,前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中,前述的直接或间接耦合涉及利用接口的通信连接,其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。
在本披露中,作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外,根据实际的需要,可以选择其中的部分或者全部单元来实现本披露实施例所述方案的目的。另外,在一些场景中,本披露实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。
在一些实现场景中,上述集成的单元可以采用软件程序模块的形式来实现。如果以软件程序模块的形式实现并作为独立的产品销售或使用时,所述集成的单元可以存储在计算机可读取存储器中。基于此,当本披露的方案以软件产品(例如计算机可读存储介质)的形式体现时,该软件产品可以存储在存储器中,其可以包括若干指令用以使得计算机设备(例如个人计算机、服务器或者网络设备等)执行本披露实施例所述方法的部分或全部步骤。前述的存储器可以包括但不限于U盘、闪存盘、只读存储器(Read Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
在另外一些实现场景中,上述集成的单元也可以采用硬件的形式实现,即为具体的硬件电路,其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件,而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此,本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现,例如CPU、GPU、FPGA、DSP和ASIC等。进一步,前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等),其例如可以是可变电阻式存储器(Resistive Random Access Memory,RRAM)、动态随机存取存储器(Dynamic Random Access Memory,DRAM)、静态随机存取存储器(Static  Random Access Memory,SRAM)、增强动态随机存取存储器(Enhanced Dynamic Random Access Memory,EDRAM)、高带宽存储器(High Bandwidth Memory,HBM)、混合存储器立方体(Hybrid Memory Cube,HMC)、ROM和RAM等。
基于上文对本披露的充分公开,本领域技术人员可以理解本披露也公开了如下条款所记载的技术方案:
条款A1、一种用于矩阵乘操作的集成电路装置,包括:
接口单元,其配置成从外部存储器获取用于所述矩阵乘操作的矩阵数据,其中所述矩阵数据包括第一矩阵和第二矩阵,其中第一矩阵和第二矩阵被分别划分成N 2个第一矩阵块和N 2个第二矩阵块,并且所述第一矩阵和第二矩阵的矩阵乘操作包括基于N 2个第一矩阵块和N 2个第二矩阵块的N 2个矩阵乘任务,其中N是大于或等于2的正整数;
N 2个主计算单元,该N 2个主计算单元依次连接以形成数据传递的回路,其中每个主计算单元配置成执行N 2个矩阵乘任务中的相应一个,并且包括:
多个存储区,其配置成存储用于执行矩阵乘任务的矩阵块和中间结果;以及
控制单元,其配置成与相邻的主计算单元进行矩阵块交换;
其中在执行相应一个所述矩阵乘任务中,每个所述主计算单元配置成:
通过所述接口单元获取与其矩阵乘任务关联的一个第一矩阵块和一个第二矩阵块,并且分别存储于第一存储区和第二存储区中;
对所述一个第一矩阵块和一个第二矩阵块执行矩阵乘操作,以得到一个中间结果;
通过所述控制单元并且利用所述第一存储区和第二存储区来与相邻的主计算单元执行N-1次矩阵块交换,并且对每次交换到的第一矩阵块和第二矩阵块执行矩阵乘操作,以分别得到N-1个中间结果;以及
对N个中间结果执行求和操作,以完成与其关联的矩阵乘任务。
条款A2、根据条款A1所述的集成电路装置,其中每个所述主计算单元包括M 2个计算子单元,并且所述第一矩阵块和所述第二矩阵块被分别划分成M 2个第一矩阵子块和M 2个第二矩阵子块,并且一个所述矩阵乘任务包括基于M 2个第一矩阵子块和M 2个第二矩阵子块的M 2个矩阵乘子任务,其中所述M 2个计算子单元的每个配置成执行M 2个矩阵乘子任务中的对应一个,并且在执行对应一个矩阵乘子任务中,所述计算子单元配置成:
执行M次如下操作,以获得M个中间子结果:
从所述第一存储区和所述第二存储区分别获取与其矩阵乘子任务关联的一个第一矩阵子块和一个第二矩阵子块;
对所述一个第一矩阵子块和对应的一个第二矩阵子块执行矩阵乘操作,以得到一个中间子结果;
对所述M个中间子结果执行求和操作,以完成与其关联的矩阵乘子任务。
条款A3、根据条款A2所述的集成电路装置,其中所述第一存储区和第二存储区是由所述N 2个计算子单元所共享的共享存储区。
条款A4、根据条款A2所述的集成电路装置,其中每个所述主计算单元的多个存储区还包括M 2个私有子存储区,并且每个私有子存储区与对应的一个计算子单元关联,并且配置成存储中间子结果。
条款A5、根据条款A2所述的集成电路装置,其中所述N 2个主计算单元配置成并行地执行与各自关联的矩阵乘任务,并且所述M 2个计算子单元配置成并行地执行与各自关联的矩阵乘子任务。
条款A6、根据条款A1-A5的任意一项所述的集成电路装置,其中根据加农算法规则来划分所述第一矩阵和第二矩阵,以得到N 2个第一矩阵块和N 2个第二矩阵块。
条款A7、根据条款A2-A5的任意一项所述的集成电路装置,其中根据加农算法规则来划分所述第一矩阵块和第二矩阵块,以得到M 2个第一矩阵子块和M 2个第二矩阵子块。
条款A8、一种板卡,包括一个或多个根据条款A1-A7的任意一项所述的集成电路装置。
条款A9、根据条款A8所述的板卡,其中当所述板卡包括P 2个所述集成电路装置时,所述集成电路装置依次连接以形成数据传递的回路,以便对分别划分成P 2*N 2*M 2个矩阵块的第一矩阵和第二矩阵执行矩阵乘操作,P是大于或等于2的正整数。
条款A10、一种计算设备,包括一个或多个根据条款A8所述的板卡。
条款A11、一种计算系统,包括多个根据条款A10所述的计算设备,其中多个计算设备互联并协同操作,以实现分布式的矩阵乘操作。
条款A12、一种使用根据条款A1-A7的任意一项所述的集成电路装置来执行矩阵乘操作的方法,包括:
使用集成电路装置的接口单元从外部存储器获取用于所述矩阵乘操作的矩阵数据,其中所述矩阵数据包括第一矩阵和第二矩阵,其中第一矩阵和第二矩阵被分别划分成N 2个第一矩阵块和N 2个第二矩阵块,并且所述第一矩阵和第二矩阵的矩阵乘操作包括基于N 2个第一矩阵块和N 2个第二矩阵块的N 2个矩阵乘任务,其中N是大于或等于2的正整数;以及
使用每个所述主计算单元来执行以下操作:
通过所述接口单元获取与其矩阵乘任务关联的一个第一矩阵块和一个第二矩阵块,并且分别存储于第一存储区和第二存储区中;
对所述一个第一矩阵块和一个第二矩阵块执行矩阵乘操作,以得到一个中间结果;
通过所述控制单元并且利用所述第一存储区和第二存储区来与相邻的主计算单元执行N-1次矩阵块交换,并且对每次交换到的第一矩阵块和第二矩阵块执行矩阵乘操作,以分别得到N-1个中间结果;以及
对N个中间结果执行求和操作,以完成与其关联的矩阵乘任务。
条款A13、根据条款A12所述的方法,其中还使用所述计算子单元来执行以下操作:
执行M次如下操作,以获得M个中间子结果:
从所述第一存储区和所述第二存储区分别获取与其矩阵乘子任务关联的一个第一矩阵子块和一个第二矩阵子块;
对所述一个第一矩阵子块和对应的一个第二矩阵子块执行矩阵乘操作,以得到一个中间子结果;
对所述M个中间子结果执行求和操作,以完成与其关联的矩阵乘子任务。
条款A14、根据条款A13所述的方法,其中所述第一存储区和第二存储区是由所述N 2个计算子单元所共享的共享存储区。
条款A15、根据条款A13所述的方法,其中每个所述主计算单元的多个存储区还包括M 2个私有子存储区,并且每个私有子存储区与对应的一个计算子单元关联,并且配置成存储中间子结果。
条款A16、根据条款A13所述的方法,其中使用所述N 2个主计算单元来并行地执行与各自关联的矩阵乘任务,并且使用所述M 2个计算子单元来并行地执行与各自关联的矩阵乘子任务。
条款A17、根据条款A12-A16的任意一项所述的方法,其中包括使用加农算法规则来划分所述第一矩阵和第二矩阵,以得到N 2个第一矩阵块和N 2个第二矩阵块。
条款A18、根据条款A13-A16的任意一项所述的方法,其中根据加农算法规则来划分所述第一矩阵块和第二矩阵块,以得到M 2个第一矩阵子块和M 2个第二矩阵子块。
条款A19、一种计算机程序产品,其包括用于执行矩阵乘操作的程序指令,当所述程序指令由一个或多个处理器来执行时,使得实现根据条款A12-A18的任意一项所述的方法。
应当理解,本公开的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本公开的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本公开说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本公开。如在本公开说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本公开说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。
虽然本文已经示出和描述了本披露的多个实施例,但对于本领域技术人员显而易见的是,这样的实施例只是以示例的方式来提供。本领域技术人员可以在不偏离本披露思想和精神的情况下想到许多更改、改变和替代的方式。应当理解的是在实践本披露的过程中,可以采用对本文所描述的本披露实施例的各种替代方案。所附权利要求书旨在限定本披露的保护范围,并因此覆盖这些权利要求范围内的等同或替代方案。

Claims (19)

  1. 一种用于矩阵乘操作的集成电路装置,包括:
    接口单元,其配置成从外部存储器获取用于所述矩阵乘操作的矩阵数据,其中所述矩阵数据包括第一矩阵和第二矩阵,其中第一矩阵和第二矩阵被分别划分成N 2个第一矩阵块和N 2个第二矩阵块,并且所述第一矩阵和第二矩阵的矩阵乘操作包括基于N 2个第一矩阵块和N 2个第二矩阵块的N 2个矩阵乘任务,其中N是大于或等于2的正整数;
    N 2个主计算单元,该N 2个主计算单元依次连接以形成数据传递的回路,其中每个主计算单元配置成执行N 2个矩阵乘任务中的相应一个,并且包括:
    多个存储区,其配置成存储用于执行矩阵乘任务的矩阵块和中间结果;以及
    控制单元,其配置成与相邻的主计算单元进行矩阵块交换;
    其中在执行相应一个所述矩阵乘任务中,每个所述主计算单元配置成:
    通过所述接口单元获取与其矩阵乘任务关联的一个第一矩阵块和一个第二矩阵块,并且分别存储于第一存储区和第二存储区中;
    对所述一个第一矩阵块和一个第二矩阵块执行矩阵乘操作,以得到一个中间结果;
    通过所述控制单元并且利用所述第一存储区和第二存储区来与相邻的主计算单元执行N-1次矩阵块交换,并且对每次交换到的第一矩阵块和第二矩阵块执行矩阵乘操作,以分别得到N-1个中间结果;以及
    对N个中间结果执行求和操作,以完成与其关联的矩阵乘任务。
  2. 根据权利要求1所述的集成电路装置,其中每个所述主计算单元包括M 2个计算子单元,并且所述第一矩阵块和所述第二矩阵块被分别划分成M 2个第一矩阵子块和M 2个第二矩阵子块,并且一个所述矩阵乘任务包括基于M 2个第一矩阵子块和M 2个第二矩阵子块的M 2个矩阵乘子任务,其中所述M 2个计算子单元的每个配置成执行M 2个矩阵乘子任务中的对应一个,并且在执行对应一个矩阵乘子任务中,所述计算子单元配置成:
    执行M次如下操作,以获得M个中间子结果:
    从所述第一存储区和所述第二存储区分别获取与其矩阵乘子任务关联的一个第一矩阵子块和一个第二矩阵子块;
    对所述一个第一矩阵子块和对应的一个第二矩阵子块执行矩阵乘操作,以得到一个中间子结果;
    对所述M个中间子结果执行求和操作,以完成与其关联的矩阵乘子任务。
  3. 根据权利要求2所述的集成电路装置,其中所述第一存储区和第二存储区是由所述N 2个计算子单元所共享的共享存储区。
  4. 根据权利要求2所述的集成电路装置,其中每个所述主计算单元的多个存储区还包括M 2个私有子存储区,并且每个私有子存储区与对应的一个计算子单元关联,并且配置成存储中间子结果。
  5. 根据权利要求2所述的集成电路装置,其中所述N 2个主计算单元配置成并行地执行与各自关联的矩阵乘任务,并且所述M 2个计算子单元配置成并行地执行与各自关联的矩阵乘子任务。
  6. 根据权利要求1-5的任意一项所述的集成电路装置,其中根据加农算法规则来划分所述第一矩阵和第二矩阵,以得到N 2个第一矩阵块和N 2个第二矩阵块。
  7. 根据权利要求2-5的任意一项所述的集成电路装置,其中根据加农算法规则来划分所述第一矩阵块和第二矩阵块,以得到M 2个第一矩阵子块和M 2个第二矩阵子块。
  8. 一种板卡,包括一个或多个根据权利要求1-7的任意一项所述的集成电路装置。
  9. 根据权利要求8所述的板卡,其中当所述板卡包括P 2个所述集成电路装置时,所述集成电路装置依次连接以形成数据传递的回路,以便对分别划分成P 2*N 2*M 2个矩阵块的第一矩阵和第二矩阵执行矩阵乘操作,P是大于或等于2的正整数。
  10. 一种计算设备,包括一个或多个根据权利要求8所述的板卡。
  11. 一种计算系统,包括多个根据权利要求10所述的计算设备,其中多个计算设备互联并协同操作,以实现分布式的矩阵乘操作。
  12. 一种使用根据权利要求1-7的任意一项所述的集成电路装置来执行矩阵乘操作的方法,包括:
    使用集成电路装置的接口单元从外部存储器获取用于所述矩阵乘操作的矩阵数据,其中所述矩阵数据包括第一矩阵和第二矩阵,其中第一矩阵和第二矩阵被分别划分成N 2个第一矩阵块和N 2个第二矩阵块,并且所述第一矩阵和第二矩阵的矩阵乘操作包括基于N 2个第一矩阵块和N 2个第二矩阵块的N 2个矩阵乘任务,其中N是大于或等于2的正整数;以及
    使用每个所述主计算单元来执行以下操作:
    通过所述接口单元获取与其矩阵乘任务关联的一个第一矩阵块和一个第二矩阵块,并且分别存储于第一存储区和第二存储区中;
    对所述一个第一矩阵块和一个第二矩阵块执行矩阵乘操作,以得到一个中间结果;
    通过所述控制单元并且利用所述第一存储区和第二存储区来与相邻的主计算单元执行N-1次矩阵块交换,并且对每次交换到的第一矩阵块和第二矩阵块执行矩阵乘操作,以分别得到N-1个中间结果;以及
    对N个中间结果执行求和操作,以完成与其关联的矩阵乘任务。
  13. 根据权利要求12所述的方法,其中还使用所述计算子单元来执行以下操作:
    执行M次如下操作,以获得M个中间子结果:
    从所述第一存储区和所述第二存储区分别获取与其矩阵乘子任务关联的一个第一矩阵子块和一个第二矩阵子块;
    对所述一个第一矩阵子块和对应的一个第二矩阵子块执行矩阵乘操作,以得到一个中间子结果;
    对所述M个中间子结果执行求和操作,以完成与其关联的矩阵乘子任务。
  14. 根据权利要求13所述的方法,其中所述第一存储区和第二存储区是由所述N 2个计算子单元所共享的共享存储区。
  15. 根据权利要求13所述的方法,其中每个所述主计算单元的多个存储区还包括M 2个私有子存储区,并且每个私有子存储区与对应的一个计算子单元关联,并且配置成存储中间子结果。
  16. 根据权利要求13所述的方法,其中使用所述N 2个主计算单元来并行地执行与各自关联的矩阵乘任务,并且使用所述M 2个计算子单元来并行地执行与各自关联的矩阵乘子任务。
  17. 根据权利要求12-16的任意一项所述的方法,其中包括使用加农算法规则来划分所述第一矩阵和第二矩阵,以得到N 2个第一矩阵块和N 2个第二矩阵块。
  18. 根据权利要求13-16的任意一项所述的方法,其中根据加农算法规则来划分所述第一矩阵块和第二矩阵块,以得到M 2个第一矩阵子块和M 2个第二矩阵子块。
  19. 一种计算机程序产品,其包括用于执行矩阵乘操作的程序指令,当所述程序指令由一个或多个处理器来执行时,使得实现根据权利要求12-18的任意一项所述的方法。
PCT/CN2021/142653 2020-12-30 2021-12-29 用于矩阵乘操作的集成电路装置、计算设备、系统和方法 WO2022143799A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/013,635 US20230376562A1 (en) 2020-12-30 2021-12-29 Integrated circuit apparatus for matrix multiplication operation, computing device, system, and method

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011610669.4 2020-12-30
CN202011610669.4A CN114692075A (zh) 2020-12-30 2020-12-30 用于矩阵乘操作的集成电路装置、计算设备、系统和方法

Publications (1)

Publication Number Publication Date
WO2022143799A1 true WO2022143799A1 (zh) 2022-07-07

Family

ID=82131660

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/142653 WO2022143799A1 (zh) 2020-12-30 2021-12-29 用于矩阵乘操作的集成电路装置、计算设备、系统和方法

Country Status (3)

Country Link
US (1) US20230376562A1 (zh)
CN (1) CN114692075A (zh)
WO (1) WO2022143799A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8626815B1 (en) * 2008-07-14 2014-01-07 Altera Corporation Configuring a programmable integrated circuit device to perform matrix multiplication
CN107305538A (zh) * 2016-04-22 2017-10-31 北京中科寒武纪科技有限公司 一种子矩阵运算装置及方法
CN107315574A (zh) * 2016-04-26 2017-11-03 北京中科寒武纪科技有限公司 一种用于执行矩阵乘运算的装置和方法
CN109801208A (zh) * 2019-01-24 2019-05-24 西安电子科技大学 基于多gpu任务优化的sar图像变化检测方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8626815B1 (en) * 2008-07-14 2014-01-07 Altera Corporation Configuring a programmable integrated circuit device to perform matrix multiplication
CN107305538A (zh) * 2016-04-22 2017-10-31 北京中科寒武纪科技有限公司 一种子矩阵运算装置及方法
CN107315574A (zh) * 2016-04-26 2017-11-03 北京中科寒武纪科技有限公司 一种用于执行矩阵乘运算的装置和方法
CN109801208A (zh) * 2019-01-24 2019-05-24 西安电子科技大学 基于多gpu任务优化的sar图像变化检测方法

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LI MING, YANG BO-HAN, SHEN XU-BANG: "Parallel Algorithms for Matrix Multiplication on the 2-D Mesh Interconnection Networks", MICROELECTRONICS & COMPUTER, vol. 22, no. 2, 20 March 2005 (2005-03-20), pages 63 - 65+69, XP055948926, ISSN: 1000-7180, DOI: 10.19304/j.cnki.issn1000-7180.2005.02.017 *
MALIHE ALIASGARI; OSVALDO SIMEONE; JOERG KLIEWER: "Private and Secure Distributed Matrix Multiplication with Flexible Communication Load", ARXIV.ORG, 1 September 2019 (2019-09-01), pages 1 - 12, XP081562026 *

Also Published As

Publication number Publication date
US20230376562A1 (en) 2023-11-23
CN114692075A (zh) 2022-07-01

Similar Documents

Publication Publication Date Title
US11841816B2 (en) Network-on-chip data processing method and device
CN112686379B (zh) 集成电路装置、电子设备、板卡和计算方法
TW201935265A (zh) 一種計算裝置及方法
CN112799726B (zh) 数据处理装置、方法及相关产品
WO2023065701A1 (zh) 内积处理部件、任意精度计算设备、方法及可读存储介质
WO2022218373A1 (zh) 用于优化片上系统的卷积运算操作的方法和相关产品
US11775808B2 (en) Neural network computation device and method
CN113010845A (zh) 执行矩阵乘法的计算装置、方法及相关产品
WO2022143799A1 (zh) 用于矩阵乘操作的集成电路装置、计算设备、系统和方法
CN111353124A (zh) 运算方法、装置、计算机设备和存储介质
CN111047005A (zh) 运算方法、装置、计算机设备和存储介质
CN112801276B (zh) 数据处理方法、处理器及电子设备
WO2022001500A1 (zh) 计算装置、集成电路芯片、板卡、电子设备和计算方法
CN112817898A (zh) 数据传输方法、处理器、芯片及电子设备
CN111047030A (zh) 运算方法、装置、计算机设备和存储介质
CN115221101B (zh) 用于优化片上系统的矩阵乘操作的方法和相关产品
CN113742266B (zh) 集成电路装置、电子设备、板卡和计算方法
CN113791996B (zh) 集成电路装置、电子设备、板卡和计算方法
WO2022001496A1 (zh) 计算装置、集成电路芯片、板卡、电子设备和计算方法
CN112232498B (zh) 一种数据处理装置、集成电路芯片、电子设备、板卡和方法
CN111353125B (zh) 运算方法、装置、计算机设备和存储介质
CN111384944B (zh) 全加器、半加器、数据处理方法、芯片及电子设备
CN111290788B (zh) 运算方法、装置、计算机设备和存储介质
WO2022001438A1 (zh) 一种计算装置、集成电路芯片、板卡、设备和计算方法
WO2022001454A1 (zh) 集成计算装置、集成电路芯片、板卡和计算方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21914531

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21914531

Country of ref document: EP

Kind code of ref document: A1