WO2021212972A1 - 运算方法、处理器以及相关产品 - Google Patents

运算方法、处理器以及相关产品 Download PDF

Info

Publication number
WO2021212972A1
WO2021212972A1 PCT/CN2021/075957 CN2021075957W WO2021212972A1 WO 2021212972 A1 WO2021212972 A1 WO 2021212972A1 CN 2021075957 W CN2021075957 W CN 2021075957W WO 2021212972 A1 WO2021212972 A1 WO 2021212972A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
row
column
elements
register
Prior art date
Application number
PCT/CN2021/075957
Other languages
English (en)
French (fr)
Inventor
刘少礼
何得园
刘道福
Original Assignee
中科寒武纪科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202010318387.0A external-priority patent/CN113536221B/zh
Priority claimed from CN202010317734.8A external-priority patent/CN113536219B/zh
Priority claimed from CN202010318380.9A external-priority patent/CN113536220A/zh
Application filed by 中科寒武纪科技股份有限公司 filed Critical 中科寒武纪科技股份有限公司
Priority to US17/920,372 priority Critical patent/US20230169144A1/en
Publication of WO2021212972A1 publication Critical patent/WO2021212972A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/50Adding; Subtracting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5443Sum of products

Definitions

  • the present disclosure relates to the field of information processing technology, and in particular to an operation method, processor and related products.
  • neural network algorithm is a very popular machine learning algorithm recently, and it has achieved very good results in various fields, such as image recognition, speech recognition, natural language processing, etc.
  • image recognition speech recognition
  • speech recognition natural language processing
  • the complexity of the algorithm is getting higher and higher.
  • the scale of the model is gradually increasing. Processing these large-scale models with GPU and CPU takes a lot of computing time and consumes a lot of power.
  • an arithmetic method for matrix multiplication based on a matrix of processing elements which is applied to a processor, the processor includes two or more processing elements, the two or more processing elements are arranged in a two-dimensional matrix , The processing element includes at least one register, and the method realizes the matrix multiplication operation of the first matrix and the second matrix,
  • the method includes:
  • the elements in each row and each column of the first matrix are stored in the register of the processing element, and the products are respectively multiplied with the elements in each column of the first matrix to calculate the product of one column And obtain the first intermediate result; or, for each column of the second matrix, store the element in each column and each row element of the first matrix in the register of the processing element, and the value in each row of the first matrix The elements are multiplied separately, and the sum of the products of a row is calculated to obtain the first intermediate result;
  • the first intermediate result is processed to obtain the product of the first matrix and the second matrix.
  • a processor includes two or more processing elements, the two or more processing elements are arranged in a two-dimensional matrix, and the processing element includes at least one register. To perform matrix multiplication operations on the first matrix and the second matrix,
  • the processor also includes a controller for loading the first matrix into the register of the processing element
  • the controller For each row of the second matrix, the controller is configured to store the element in each row and each column element of the first matrix in the register of the processing element corresponding to each row, and to respectively multiply the product with the element in each column of the first matrix Calculate the sum of the products of one column to obtain the first intermediate result; or, for each column of the second matrix, the controller is used to store the element in each column and each row element of the first matrix in the register of the processing element , Calculate the product with the elements in each row of the first matrix, calculate the sum of the product of a row to obtain the first intermediate result;
  • the controller is also used to process the first intermediate result to obtain the product of the first matrix and the second matrix.
  • an artificial intelligence chip including the processor as described above.
  • an electronic device including the artificial intelligence chip as described above.
  • an electronic device including the processor as described above.
  • the calculation methods and processors for matrix multiplication according to the foregoing embodiments of the present disclosure are more suitable for processors composed of processing elements arranged in an array, and have high calculation efficiency. And for an input matrix of any scale that satisfies the arrangement of the processing elements, the operation result of the matrix multiplication can be obtained, the number of memory accesses can be reduced, the bandwidth pressure can be reduced, and the efficiency of the operation can be improved.
  • a processor includes two or more processing elements, the two or more processing elements are arranged in a two-dimensional matrix, and the processing element includes at least one register. To perform matrix multiplication operations on the first matrix and the second matrix,
  • the processor further includes a controller configured to load each element of the transposed matrix of the first matrix and the second matrix into the registers of each processing element, respectively, the transposed matrix and the second matrix The element at the corresponding position is stored in the register of the same processing element;
  • the controller is used to control the transposed matrix or the second matrix to scroll in the row direction or the column direction, and control the processing element to multiply the elements in the corresponding register to obtain the element product, and to obtain the element product of the same row or the same column And get the first intermediate result;
  • the controller is further configured to process the first intermediate result to obtain the product of the first matrix and the second matrix.
  • an arithmetic method for matrix multiplication based on a matrix of processing elements which is applied to a processor, the processor includes two or more processing elements, and the two or more processing elements form a two-dimensional matrix Arrangement, the processing element includes at least one register, the method implements a matrix multiplication operation on the first matrix and the second matrix, and the method includes:
  • Transpose the first matrix to obtain a transposed matrix load the elements of the transposed matrix and the second matrix into the registers of each processing element, respectively, and store the elements at the corresponding positions of the transposed matrix and the second matrix in the same processing element.
  • the register In the register
  • the first intermediate result is processed to obtain the product of the first matrix and the second matrix.
  • an artificial intelligence chip including the processor as described above.
  • an electronic device including the artificial intelligence chip as described above.
  • the operation result of the matrix multiplication can be obtained, and compared with the matrix in the related art Multiplication can reduce the number of memory accesses, reduce bandwidth pressure, and improve the efficiency of calculations.
  • an arithmetic method for matrix multiplication based on a matrix of processing elements which is applied to a processor, the processor includes two or more processing elements, the two or more processing elements are arranged in a two-dimensional matrix , The processing element includes at least one register, the method implements a matrix multiplication operation on the first matrix and the second matrix, and the method includes:
  • the element product matrix is processed according to the manner of preprocessing the first matrix and the second matrix to obtain the product of the first matrix and the second matrix.
  • the third matrix and the fourth matrix are scrolled in the row direction or the column direction, and the processing element is controlled to multiply the elements in the corresponding registers to obtain the element product matrix, which includes:
  • the control processing element performs multiplication operations on the elements in the corresponding registers to obtain the first element product matrix
  • the elements are multiplied to obtain the second element product matrix, which is repeated p-1 times to obtain the second element product matrix.
  • processing the element product matrix according to the manner of preprocessing the first matrix and the second matrix to obtain the product of the first matrix and the second matrix includes:
  • the fifth matrix is obtained by summing the first element product matrix and the second element product matrix, and the fifth matrix is processed according to the manner of preprocessing the first matrix and the second matrix to obtain the product of the first matrix and the second matrix.
  • a processor includes two or more processing elements, the two or more processing elements are arranged in a two-dimensional matrix, and the processing element includes at least one register.
  • the row rank of, n represents the column rank of the second matrix, the column rank of the first matrix and the row rank of the second matrix are k, and p is the maximum of m, k, and n;
  • the controller is used to scroll the third matrix and the fourth matrix in the row direction or the column direction, and control the processing element to perform multiplication operations on the elements in the corresponding registers to obtain the element product matrix;
  • the controller is used for processing the element product matrix according to the preprocessing method of the first matrix and the second matrix to obtain the product of the first matrix and the second matrix.
  • an arithmetic device based on matrix multiplication of a matrix of processing elements, including the above-mentioned processor.
  • a non-volatile computer-readable storage medium having computer program instructions stored thereon, wherein the computer program instructions implement the above method when executed by a processor.
  • an artificial intelligence chip including the processor as described above.
  • an electronic device including the artificial intelligence chip as described above.
  • the matrix multiplication operation method, processor, and related products of the foregoing embodiments of the present disclosure there is no need to repeatedly read data during matrix multiplication operations, which reduces the number of times to read memory, reduces bandwidth pressure, and has high computational efficiency. And for an input matrix of any size, the input matrix can be transformed by preprocessing, and then the operation can be performed to obtain the result of the matrix multiplication.
  • Figure 1-1 shows a schematic diagram of a processor according to an embodiment of the present disclosure.
  • Figures 1-2a and 1-2b respectively show examples of different division methods.
  • Figures 1-3 show a flowchart of an operation method according to an embodiment of the present disclosure.
  • Figures 1-4 show schematic diagrams of an array composed of processing elements according to an embodiment of the present disclosure.
  • FIGS 1-5 show schematic diagrams of block division according to an embodiment of the present disclosure.
  • Figures 1-6 show examples of matrix division according to an embodiment of the present disclosure.
  • Figure 2-1 shows a schematic diagram of a processor according to an embodiment of the present disclosure.
  • Figures 2-2a and 2-2b respectively show examples of multiple different division methods.
  • Figures 2-3 show a flowchart of an operation method according to an embodiment of the present disclosure.
  • Figures 2-4 show schematic diagrams of an array composed of processing elements according to an embodiment of the present disclosure.
  • FIGS 2-5 show schematic diagrams of block division according to an embodiment of the present disclosure.
  • Figures 2-6 show examples of matrix division according to an embodiment of the present disclosure.
  • Figure 3-1 shows a schematic diagram of a processor according to an embodiment of the present disclosure.
  • Figures 3-2a and 3-2b respectively show examples of different ways of dividing the matrix.
  • Fig. 3-3 shows a flowchart of an operation method according to an embodiment of the present disclosure.
  • Figures 3-4 show a schematic diagram of block division according to an embodiment of the present disclosure.
  • Fig. 4 shows a structural block diagram of a board according to an embodiment of the present disclosure.
  • the term “if” can be interpreted as “when” or “once” or “in response to determination” or “in response to detection” depending on the context.
  • the phrase “if determined” or “if detected [described condition or event]” can be interpreted as meaning “once determined” or “in response to determination” or “once detected [described condition or event]” depending on the context ]” or “in response to detection of [condition or event described]”.
  • Matrix operation occupies a relatively large amount of calculation in the process of using artificial intelligence to process information, and existing processors disassemble matrix operations into multiplication and addition operations in the process of processing matrix operations, which requires frequent memory Reading data in the middle, the efficiency of calculation is very low.
  • multi-stage pipelines are usually used to implement the operation process.
  • each stage processes part of the input data
  • the multi-stage pipeline Therefore, data needs to be read from the memory frequently, and frequent access to the memory leads to higher bandwidth requirements.
  • the present disclosure provides an operation method and a processor for executing the operation method.
  • the processor may include multiple processing elements.
  • the multiple processing elements may be arranged in a two-dimensional matrix to better adapt to matrix operations, and each processing element may include at least one register.
  • FIG. 1-1 shows a schematic diagram of a processor according to an embodiment of the present disclosure.
  • multiple processing elements PE Processing Element
  • each processing element is connected to adjacent processing elements.
  • Each PE can be provided with at least one register ( register) (not shown in the figure).
  • the processor may also include a controller and a memory, where both the controller and the memory are connected to multiple processing elements, and the controller may be connected to the memory. The controller is used for loading data from the memory to the register of the processing element, and controlling the processing element to process the input data.
  • the controller may first load the elements of one matrix into the register corresponding to each PE, and then load the elements of the other matrix into rows or columns or according to the way of element traversal.
  • the loading position of the element in the register matrix is stored in the corresponding register, and then each PE is controlled to perform operations on the elements stored in the register set in the PE.
  • an executable program may also be stored in the memory, the executable program may include instructions, and the processor executes the instructions to implement matrix multiplication operations.
  • the controller can be provided with a loader, a decoder, etc., where the loader can be used to load the input data in the memory into the register of the processing element, and the decoder can adjust the input data according to the change of the storage address of the input data after loading.
  • the instruction to access the data in the executable program is decoded. For example, for the instruction to access the data, the address stored in the register of the data obtained by decoding is assigned to the instruction to access the data, and the decoded instruction is sent to the processing element , The processing element executes instructions to process data.
  • the memory may be an on-chip cache
  • the controller may load the executable program on the off-chip flash memory and input data (for example, the input matrix, including the left multiplication matrix and the right multiplication matrix) into the above-mentioned memory ( In the on-chip cache), the subsequent matrix multiplication process is performed.
  • the controller can also directly load the input matrix and the executable program from the off-chip memory to the register of the processing element, which is not limited in the present disclosure.
  • the PE may also include an arithmetic unit to complete the specified operation. Taking matrix operation as an example, the PE may include, for example, a multiplier, an adder, etc.
  • the specific structure of each PE may be the same or different, and this disclosure will not make this limited.
  • the PE may also include other types of arithmetic units to adapt to various different arithmetic processes. The present disclosure does not limit the number and types of arithmetic units included in the PE.
  • the input matrix of the multiplication operation may include a left multiplication matrix and a right multiplication matrix, where the left multiplication matrix may refer to the matrix located on the left side of the multiplication sign, and the right multiplication matrix may refer to the matrix located on the right side of the multiplication sign.
  • the operation method provided by the present disclosure is used to realize the matrix multiplication operation of the first matrix and the second matrix.
  • the first matrix may be a left-multiplying matrix
  • the second matrix may be a right-multiplying matrix
  • the first matrix may be a right-multiplying matrix
  • the second matrix may be a left-multiplying matrix.
  • the controller may determine one of the input matrices as the matrix to be loaded. Since the number and arrangement of PEs in the processor are fixed, the controller may block the matrix to be loaded in some cases, and may not block the matrix loaded into the processor in some cases. For another matrix other than the matrix to be loaded in the input matrix, block processing may not be performed.
  • the controller may determine the matrix to be loaded from the input matrix, and determine whether to block the matrix to be loaded according to the arrangement of processing elements and the number of rows and columns of the matrix to be loaded.
  • the arrangement of processing elements may refer to the number of rows and columns of the processing elements
  • the row rank and column rank of the matrix to be loaded may refer to the number of rows and columns of the matrix.
  • the matrix to be loaded may be a left-multiplying matrix or a right-multiplying matrix, which is not limited in the present disclosure.
  • the controller may not block the matrix to be loaded. If the number of rows of the matrix to be loaded is greater than If the number of rows of the processing element or the number of columns of the matrix to be loaded is greater than the number of columns of the processing element, the controller can divide the matrix to be loaded into blocks.
  • the controller when determining the matrix to be loaded from the input matrix, the controller may determine randomly, or according to the priority of the arrangement of processing elements, determine that the matrix that does not need to be divided is the matrix to be loaded.
  • the specific determination method is not limited.
  • the array of processing elements can be expressed as PE MN , which means that the processing elements are an M ⁇ N matrix, where M represents the number of rows of processing elements, N represents the number of columns of processing elements, and both M and N are greater than 0 Is a positive integer.
  • multiplying matrix is a mn, a left by a matrix of m ⁇ n matrix, where, m represents the number of rows of the matrix a mn, n is a matrix representing the number of columns a mn of, m, and n are positive integers, right-multiplying the matrix b nk is, a right matrix by n ⁇ k matrix, where n is the number of rows of the matrix b nk, k is the number of columns of the matrix b nk, k is a positive integer. If m is less than M, n is less than N, n is greater than M, or k is greater than N, the controller may select the matrix a mn as the matrix to be loaded.
  • both input matrices meet the condition that no block is required, both can be used as the matrix to be loaded.
  • the controller can randomly determine one of them as the matrix to be loaded, or it can choose to include elements More matrices are used as the to-be-loaded matrices, which can reduce the number of loading elements and improve computing efficiency.
  • the controller may block the matrix to be loaded according to the arrangement of the elements to be processed and the row rank and column rank of the matrix to be loaded to obtain more than two first matrices.
  • loading the first matrix to each processing element is taken as an example, that is, the matrix to be loaded is used as the first matrix or the matrix obtained after the matrix to be loaded is divided into blocks is used as the first matrix.
  • the controller can use the right multiplication as the second matrix, and if the loaded first matrix is the right multiplication matrix, then the controller can multiply the left matrix As the second matrix.
  • the controller may process another matrix in the input matrix according to the situation.
  • the controller can block another matrix other than the matrix to be loaded in the input matrix, or not. Perform chunking.
  • the controller may not block another matrix; if the matrix to be loaded is a left-multiplied matrix, the matrix to be loaded is in the column The direction is divided into blocks, and at this time, the controller may block another matrix other than the matrix to be loaded in the input matrix to obtain two or more second matrices according to the manner in which the matrix to be loaded is divided into blocks.
  • the controller can block another matrix other than the matrix to be loaded in the input matrix according to the way the matrix to be loaded is divided into blocks. Two or more second matrices; if the matrix to be loaded is a right-multiplied matrix, and the matrix to be loaded is divided into blocks in the column direction, the controller may not block another matrix at this time.
  • the matrix to be loaded is a mn
  • matrix b nk If to be loaded matrix b nk, then it is determined whether matrix b nk is divided into blocks, if the matrix b nk is the number of rows n is not greater than the processing according to the number of rows and columns of the matrix b nk and the number of rows and columns of processing elements The number of rows of elements M and the number of columns k is not greater than the number of columns N of processing elements, so the matrix b nk may not be divided into blocks. If the number of rows n of the matrix b nk is greater than the number of rows M of the processing element, or the number of columns k is greater than the number of columns N of the processing element, the matrix b nk can be divided into blocks in the row direction or the column direction.
  • the matrix obtained after block division satisfies the condition that no block is required, that is, the number of rows of the matrix after block division is not greater than the number of rows of processing elements, and the number of columns is not greater than the number of processing elements.
  • the number of columns of the component is not greater than the number of rows of processing elements.
  • the controller can block the matrix a mn in the row direction, because the matrix a mn is left multiplied Therefore, the matrix is divided into blocks in the row direction and does not affect the normal operation of the right-multiplied matrix. Therefore, the controller may not perform block processing on the right-multiplied matrix. If the number of rows m of the matrix a mn is not greater than the number of rows M of the processing element, and the number of columns n is greater than the number of columns N of the processing element, the matrix a mn can be divided into blocks in the column direction.
  • a mn divides the row direction of the right multiplication matrix into blocks in the column direction, and divides the left multiplication matrix column direction and the right multiplication matrix row direction in the same way, and the same block means The number of columns of the first matrix and the number of rows of the second matrix obtained after block division are the same to ensure that the matrix operation can be completed normally. If the number of rows m of the matrix a mn is greater than the number of rows M of the processing element, and the number of columns n is greater than the number of columns N of the processing element, the controller can block the matrix a mn in the row direction and the column direction.
  • mn blocks the row direction of the right multiplication matrix in the column direction, and blocks the left multiplication matrix column direction and the right multiplication matrix row direction in the same manner.
  • the same block division refers to The number of columns of the first matrix and the number of rows of the second matrix obtained after the block are the same to ensure that the matrix operation can be completed normally.
  • the controller may block the matrix b nk in the column direction. Since the matrix b nk is a right-multiplied matrix, blocking in the column direction does not affect the normal operation of the left-multiplied matrix, so the controller may not perform block processing on the left-multiplied matrix. If the number of rows n of the matrix b nk is greater than the number of rows M of the processing element, and the number of columns k is not greater than the number of columns N of the processing element, the matrix b nk can be divided into blocks in the row direction.
  • b nk divides the column direction of the left multiplying matrix into blocks in the row direction, and divides the left multiplying matrix column direction and the right multiplying matrix row direction in the same way.
  • the same block means The number of columns of the first matrix and the number of rows of the second matrix obtained after block division are the same to ensure that the matrix operation can be completed normally. If the number of rows n of the matrix b nk is greater than the number of rows M of the processing element, and the number of columns k is greater than the number of columns N of the processing element, the controller can block the matrix b nk in the row and column directions.
  • the column direction of the left multiplication matrix can be divided into blocks according to the way of dividing the matrix b nk in the row direction, and the left multiplication matrix column direction and the right multiplication matrix row direction can be divided in the same way, the same way Blocking means that the number of columns of the first matrix and the number of rows of the second matrix obtained after the block are the same to ensure that the matrix operation can be completed normally.
  • the block can be performed in such a way that the row rank and column rank of the block matrix are as close as possible to the number of rows and columns of the processing element, which can improve the efficiency of calculation and shorten the calculation time.
  • the processing element is a 4 ⁇ 4 array
  • the block can be divided into a 4 ⁇ 4 matrix first, so that the processing element can be used with maximum efficiency and the calculation efficiency can be improved.
  • Figure 1-2a is an example of partitioning.
  • Matrix a 24 is divided into two parts in the column direction, each part includes two columns, and matrix b 43 is divided into two parts in the row direction, and each part includes two rows;
  • Figure 1-2b is Another example of partitioning is that the matrix a 24 is divided into three parts in the column direction, one part includes two columns, and the other two parts both include one column, and the matrix b 43 is divided into three parts in the row direction, one part includes two rows and the other two The part includes one line.
  • the arrangement of the above processing elements and the block method of the input matrix are only an example of the present disclosure, and do not limit the present disclosure in any way.
  • the row rank and column rank of the matrix divided by the block method in Figure 1-2a are closer to the number of rows and columns of processing elements. This can help to improve the utilization of processing elements and reduce control complexity. For the same input matrix, since the number of blocks after block division is small, the number of times to load data is small, and the operation efficiency of this block division method is higher.
  • the present disclosure does not make specific restrictions on the block method of the row direction of the left multiplication matrix and the column direction of the right multiplication matrix, as long as the block after the matrix meets the condition that no more block is required.
  • the first matrix after division can also be stored in the register of the processing element in a stacked storage manner.
  • the multiplication operation of the input matrix For example, each processing element can include multiple registers, and the controller can divide the registers in the processing element into multiple different groups. After the controller divides the input matrix into blocks, it can stack the registers in multiple groups. The two or more first matrices are stored, and each group stores one first matrix. In this embodiment, the controller may use another matrix other than the matrix to be loaded in the input matrix as the second matrix. It should be noted that stacked storage is only an optional implementation, and the present disclosure is not limited to this.
  • Figures 1-3 show a flowchart of an operation method according to an embodiment of the present disclosure. Taking as an example that the matrix to be loaded does not need to be divided into blocks, the operation method of the present disclosure is described first. It is assumed that the matrix to be loaded is the first matrix, and the other matrix in the input matrix except the matrix to be loaded is the second matrix, as shown in Figure 1 As shown in -3, the calculation method provided by the present disclosure may include the following steps:
  • Step S1-11 load the first matrix into the register of each processing element
  • the arrangement manner of the elements in the first matrix in the matrix is the same as the arrangement manner in the register of the processing element;
  • Step S1-12 For each row or each column of the second matrix, store the elements in each row or each column and each column or row of the first matrix in the register of the processing element corresponding to each row or column of the second matrix. Calculate the product of each column or row of each element to obtain the first intermediate result; that is, for each row or column of the first matrix, store the elements of each row or column To the register of the processing element where the register stored in each column or row of the first matrix is located.
  • the elements in each row and each column of the first matrix are correspondingly stored in the register of the processing element, and the products are respectively multiplied with the elements in each column of the first matrix, Calculate the sum of the products of a column to obtain the first intermediate result; or, for each column of the second matrix, store the element in each column and each row element of the first matrix in the register of the processing element, and the value of the first matrix
  • the elements in each row are multiplied separately, and the sum of the products of a row is calculated to obtain the first intermediate result.
  • Step S1-13 processing the first intermediate result to obtain the product of the first matrix and the second matrix.
  • the controller can directly use the left multiplication matrix as the first matrix and the right multiplication matrix as the second matrix, or use the left multiplication matrix as the second matrix and the right multiplication matrix as the first matrix. Not limited.
  • the first matrix is a left-multiplying matrix and the second matrix is a right-multiplying matrix.
  • each element in the column element can be The corresponding column of elements in the first matrix is stored in the register of the processing element (in other words, each element in the column of elements is stored in the register of the processing element where the register of the corresponding column of elements in the first matrix is located), and the control
  • Each processing element multiplies the elements in the corresponding register to obtain the element product, and calculates the sum of the element product of each row to obtain the first intermediate result.
  • a column of elements in the first matrix corresponding to each element means that the number of rows of the element in the second matrix is the same as the number of columns of a column of elements in the second matrix.
  • the first matrix is a right-multiplying matrix
  • the second matrix is a left-multiplying matrix.
  • each element in the row element can be A row of elements corresponding to the first matrix is stored in the register of the processing element, each processing element is controlled to multiply the elements in the corresponding register to obtain the element product, and the sum of the element product of each column is calculated to obtain the first intermediate result.
  • a row of elements in the first matrix corresponding to each element means that the number of columns of the element in the second matrix is the same as the number of rows where the elements of a row are located.
  • the processing of the first intermediate result in step S1-13 is different. Specifically, if the first matrix is a left multiplication matrix, then the obtained first intermediate result is used as a column element of the product matrix of the first matrix and the second matrix, and the number of columns in the product matrix of the first intermediate result is summed to obtain The number of columns in the second matrix of the first intermediate result is the same; if the first matrix is a right-multiplied matrix, then the first intermediate result obtained is used as a row element of the product matrix of the first matrix and the second matrix, and the first The number of rows in the product matrix of the intermediate result is the same as the number of rows in the second matrix from which the first intermediate result is obtained.
  • the controller may control the processing elements in the row or the column to move the element product calculated each time to a processing element in the row or the column. , And control a processing element in the row or column to calculate the sum of the element products to obtain the first intermediate result. For example, when the first matrix is the left-multiplying matrix and the second matrix is the right-multiplying matrix, each time the element product is calculated, the controller can control the processing element in the same row to move the calculated element product to that row.
  • a processing element In a processing element, and control the processing element to calculate the sum of element products to obtain the first intermediate result; when the first matrix is a right-multiplied matrix and the second matrix is a left-multiplied matrix, each time the element product is calculated, control
  • the processor can control the processing elements in the same column to move the calculated element product to a processing element in the column, and control the one processing element to calculate the sum of the element products to obtain the first intermediate result.
  • the processing element can use an adder to calculate the sum of the product of the elements.
  • One of the processing elements may be a processing element that stores elements of the first matrix, or may be a processing element that does not store elements of the first matrix, which is not limited in the present disclosure.
  • a special adder may also be set on the row or column of the processing element array to implement the above calculation process.
  • Example 1-1 The first matrix is the left multiplication matrix, and the second matrix is the right multiplication matrix
  • both the first matrix a mn and the second matrix b nk are 3 ⁇ 3 matrices, and the processing element is a 4 ⁇ 4 array.
  • Figures 1-4 show schematic diagrams of an array composed of processing elements according to an embodiment of the present disclosure. The calculation method of the present disclosure will be described with reference to Figs. 1-4 and Figs. 1-3.
  • Loading the first matrix into the register of the processing element can be loaded into the register of the processing element according to the arrangement of rows and columns of the first matrix, that is, the elements in the first matrix are
  • the arrangement is the same as the arrangement in the register of the processing element.
  • the same arrangement means that the row index of all elements in the matrix is the same as the row difference value of the processing element where it is located, and the column index of all elements is the same.
  • the difference between the column subscripts of the processing element where it is located is the same.
  • the number of rows and columns of the elements in the first matrix in the matrix is the same as the number of rows and columns of the processing element loaded with the element in the array of processing elements.
  • the controller can load A 11 into the register of PE 11 , A 12 into the register of PE 12 , A 13 into the register of PE 13 , and A 21 into the register of PE 21 .
  • the register...A 33 is loaded into the register of PE 33 , that is, the subscript of the element in the first matrix can be exactly the same as the subscript of the processing element where it is located, the row subscript difference value and the column subscript difference value mentioned above Both are 0.
  • the controller can load A 11 into the register of PE 12 , A 12 into the register of PE 13 , A 13 into the register of PE 14 , A 21 into the register of PE 22...
  • a 33 is loaded into the register of PE 34 , that is, the arrangement of the elements in the first matrix in the matrix is the same as the arrangement in the register of the processing element, the row subscript is 0 and the column subscript is the difference The value is 1.
  • the controller may store the element B 11 in the first column of the second matrix to a corresponding column of elements in the first matrix to the processing
  • the register of the element, the corresponding column of elements means that the number of rows of the element in the second matrix is the same as the number of columns of a column of elements in the first matrix.
  • B 11 is the first row in the first matrix, then the corresponding column
  • the element refers to the element in the first column of the first matrix. That is, the controller stores the element B 11 in the register of the processing element where the registers stored in A 11 , A 21 , and A 31 are located.
  • the controller stores the element B 21 in the first column of the second matrix in the register of the processing element where the registers stored in A 12 , A 22 , and A 32 are located, and stores the element B 31 in the first column of the second matrix To the register of the processing element where the registers stored in A 13 , A 23 , and A 33 are located.
  • B 11 and A 11 are stored in the register of the same processing element
  • B 11 and A 21 are stored in the register of the same processing element
  • B 11 and A 31 are stored in the register of the same processing element.
  • B 21 and A 12 are stored in the register of the same processing element
  • B 21 and A 22 are stored in the register of the same processing element
  • B 21 and A 32 are stored in the register of the same processing element.
  • B 31 and A 13 are stored in the register of the same processing element
  • B 31 and A 23 are stored in the register of the same processing element
  • B 31 and A 33 are stored in the register of the same processing element.
  • the controller in the processor controls the processing elements to calculate the products of the elements stored in the corresponding registers, and then calculates the sum of the products of each row to obtain the first intermediate result respectively: B 11 ⁇ A 11 +B 21 ⁇ A 12 +B 31 ⁇ A 13 , B 11 ⁇ A 21 +B 21 ⁇ A 22 +B 31 ⁇ A 23 , B 11 ⁇ A 31 +B 21 ⁇ A 32 +B 31 ⁇ A 33 .
  • the above-mentioned first intermediate result can be expressed as: C 11 , C 21 , C 31 .
  • the controller may load A 11 into the register of PE 11 , A 12 into the register of PE 12 , A 13 into the register of PE 13 , and A 21 into the register.
  • a 11 into the register of PE 11
  • a 12 into the register of PE 12
  • a 13 into the register of PE 13
  • a 21 into the register.
  • the subscript of the element in the first matrix can be exactly the same as the subscript of the processing element where it is located, the row subscript difference value and the column The subscript differences are all 0.
  • the controller controls the processing element to use a multiplier to multiply the elements in the respective registers to obtain Element product
  • the controller can control each row of processing elements to move the calculated element product to a processing element in the row, for example, the controller can control PE 11 , PE 12 and PE 13 to calculate the calculated element product B 11 ⁇ a 11, B 21 ⁇ a 12, B 31 ⁇ a 13 moves to the processing elements PE 14, PE control employed adder 14 sums the products to obtain the above-mentioned elements C 11, should be noted that, the controller may control the first The processing element of a row moves the element product to PE 11 , PE 12 or PE 13 , which is not limited in the present disclosure. After the controller controls the processing elements in the second row and the third row to perform similar operations, the first intermediate results C 11 , C 21 , C 31 can be obtained.
  • the product of the first matrix and the second matrix can be obtained by storing in columns. That is, as described above, when the first matrix is a left multiplication matrix, the first intermediate result obtained each time is used as a column of elements of the product matrix of the first matrix and the second matrix.
  • the number of columns in the product matrix of the first intermediate result is the same as the number of columns in the second matrix from which the first intermediate result is obtained.
  • the first column element in the second matrix is The first intermediate results C 11 , C 21 , and C 31 obtained by performing operations on elements in a matrix are the first column of c 33.
  • Example 1-2 The first matrix is the right multiplication matrix, and the second matrix is the left multiplication matrix
  • both the first matrix a mn and the second matrix b nk are 3 ⁇ 3 matrices, and the processing element is a 4 ⁇ 4 array.
  • the first matrix is loaded into the register of the output processing element, and the loading method can refer to the method of loading the first matrix in Example 1-1, which is not repeated here.
  • the element B 11 in the first row of the second matrix and the corresponding row of elements in the first matrix are stored in the register of the processing element, and the corresponding row of elements refers to the The number of columns of an element in the second matrix is the same as the number of rows of a column of elements in the first matrix.
  • B 11 is the first column in the first matrix, so the corresponding column of elements refers to the first row in the first matrix element. That is, the controller can store the element B 11 in the register of the processing element where the registers stored by A 11 , A 12 , and A 13 are located.
  • B 11 and A 11 are stored in the register of the same processing element
  • B 11 and A 12 are stored in the register of the same processing element
  • B 11 and A 13 are stored in the register of the same processing element.
  • B 12 and A 21 are stored in the register of the same processing element
  • B 12 and A 22 are stored in the register of the same processing element
  • B 12 and A 23 are stored in the register of the same processing element.
  • B 13 and A 31 are stored in the register of the same processing element
  • B 13 and A 32 are stored in the register of the same processing element
  • B 13 and A 33 are stored in the register of the same processing element.
  • the controller in the processor controls the processing elements to calculate the products of the elements stored in the corresponding registers, and then calculates the sum of the products of each column to obtain the first intermediate results: B 11 ⁇ A 11 +B 12 ⁇ A 21 +B 13 ⁇ A 31 , B 11 ⁇ A 12 +B 12 ⁇ A 22 +B 13 ⁇ A 32 , B 11 ⁇ A 13 +B 12 ⁇ A 23 +B 13 ⁇ A 33 .
  • the above-mentioned first intermediate result can be expressed as: C 11 , C 12 , C 13 .
  • the controller may load A 11 into the register of PE 11 , A 12 into the register of PE 12 , A 13 into the register of PE 13 , and A 21 into the register.
  • a 11 into the register of PE 11
  • a 12 into the register of PE 12
  • a 13 into the register of PE 13
  • a 21 into the register.
  • the subscript of the element in the first matrix can be exactly the same as the subscript of the processing element where it is located, the row subscript difference value and the column The subscript differences are all 0.
  • the controller controls the processing element to use a multiplier to multiply the elements in the respective registers to obtain Element product
  • the controller can control each column of processing elements to move the calculated element product to a processing element in that column.
  • the controller can control PE 11 , PE 21 and PE 31 to calculate the calculated element product B 11 ⁇ A 11 , B 12 ⁇ A 21 , B 13 ⁇ A 31 are moved to the processing element PE 41 , and the PE 14 is controlled to use an adder to sum the above-mentioned element products to obtain C 11.
  • the controller can also control the first The processing elements of a row move the element product to PE 11 , PE 21 or PE 31 , which is not limited in the present disclosure. After the controller controls the processing elements in the second row and the third row to perform similar operations, the first intermediate results C 11 , C 12 , C 13 can be obtained.
  • the product of the first matrix and the second matrix can be obtained by storing in columns.
  • the calculation result of the matrix multiplication can be obtained for an input matrix of any scale that satisfies the arrangement of the processing elements.
  • the result of matrix multiplication can be directly obtained according to the above example.
  • the matrix multiplication operation method is more suitable for a processor composed of processing elements arranged in an array. Compared with the matrix multiplication operation in the related art, the number of memory accesses can be reduced, the bandwidth pressure is reduced, and the Operational efficiency.
  • block division for the first matrix and second matrix after block division (it can be obtained by block, or directly use another matrix as the second matrix), according to the first matrix and the corresponding first matrix
  • the product of the left multiplication matrix and the right multiplication matrix is calculated according to the matrix multiplication rule.
  • the first matrix and the second matrix obtained after the block can be used as an element of the matrix, and the second intermediate result can be obtained by performing the operation process of matrix multiplication according to the rules of matrix multiplication.
  • the product of the input matrix for the first matrix and second matrix after block division (it can be obtained by block, or directly use another matrix as the second matrix), according to the first matrix and the corresponding first matrix
  • the product of the left multiplication matrix and the right multiplication matrix is calculated according to the matrix multiplication rule.
  • Figures 1-5 show schematic diagrams of block division according to an embodiment of the present disclosure.
  • the matrices D and E are divided into blocks in the manner described above to obtain the first matrix D 11 , D 12 , D 21 , D 22 , and the second matrix E 11 , E 12 , E 21 , E 22 .
  • the first matrix and the second matrix can be used as an element of the matrix to perform the operation process of matrix multiplication.
  • the specific process of calculating the second intermediate result can be obtained by performing operations on the corresponding first matrix and second matrix respectively according to the process of step S1-11 to step S1-13.
  • the second intermediate result is obtained by dividing the input matrix into blocks, and performing the matrix multiplication operation of the present disclosure on the block-blocked matrix respectively to obtain the second intermediate result, and the product of the input matrix can be calculated according to the second intermediate result using the rule of matrix multiplication.
  • the process of matrix multiplication can be quickly realized for any dimension of the matrix, and the calculation efficiency is high.
  • each processing element may include multiple registers, and the controller may divide the registers in the processing element into multiple sets of registers. Then, the processor includes multiple sets of registers, and each set of registers is used to store the partitioned registers. A first matrix. Therefore, in a possible implementation manner, the controller may group the registers of the processing element according to the manner of dividing the input matrix into blocks to obtain multiple sets of registers.
  • the calculation method of the present disclosure may further include:
  • the controller stacks and stores the two or more first matrices in the multiple sets of registers, and each set of registers stores one first matrix.
  • the controller may also store the first matrix one at a time, referring to the example in FIGS. 1-5, and calculate the product of the input matrix according to the second intermediate result.
  • the second matrix corresponding to the first matrix may refer to a matrix that needs to be multiplied with the first matrix among the matrixes obtained by dividing the left-multiply matrix/right-multiply matrix according to the matrix multiplication rule.
  • the processing element is a 2 ⁇ 2 array
  • the input matrix is a 4 ⁇ 4 matrix as an example to illustrate the operation method of the present disclosure.
  • both the left multiplication matrix and the right multiplication matrix can be divided into 2 ⁇ 2 matrices. It should be noted that the above block method is only an example of the present disclosure, and other methods may also be used to perform block, which is not limited in the present disclosure.
  • Figures 1-6 show examples of matrix division according to an embodiment of the present disclosure. As shown in Figure 1-6, both the left multiplication matrix and the right multiplication matrix can be divided into 2 ⁇ 2 sub-matrices. After the left multiplication matrix is divided, four first matrices a 11 , a 12 , a 21 , a 22 are obtained , Where a 11 is a 12 is a 21 is a 22 is After the right multiplication matrix is divided into four second matrices b 11 , b 12 , b 21 , b 22 , where b 11 is b 12 is b 21 is b 22 is
  • step S1-11-step S1-13 Taking the calculation of the second intermediate result using the process of step S1-11-step S1-13 as an example, assuming that the processing element is a 2 ⁇ 2 array, taking the example shown in FIGS. 1-6 as an example, for the operation method of the present disclosure, You can load the first matrix, and the result of loading is shown in Table 1-1.
  • Reg0, Reg1, Reg2, and Reg3 respectively represent a group of registers in the processing element.
  • the processing element is a 2 ⁇ 2 array.
  • Each processing element includes multiple registers.
  • the registers in the same group are used for data storage.
  • the first matrix and the corresponding second matrix are processed in the manner of step S1-12: Reg0 stores a 11 , stores the first column of b 11 in the first row of a 11 and In the register of the processing element where the second row is located, Reg1 stores a 12 , stores the first column of b 21 in the first row of a 12 and the register of the processing element where the second row is located, Reg2 stores a 21 and stores b The first column of 12 is stored in the register of the processing element where the first row and second row of a 21 are located, Reg3 stores a 22 , and the first column of b 22 is stored in the first row and second row of a 22.
  • the register of the processing element is shown in Table 1-2.
  • the controller in the processor controls the processing element to calculate the product of the elements stored in the corresponding register to obtain the element product, and then calculate the sum of the element product of each row to obtain the first intermediate result (the specific process can be as described in the above example, No longer).
  • the controller in the processor controls the processing element to calculate the product of the elements stored in the corresponding register to obtain the element product, and then calculate the sum of the element product of each row to obtain the first intermediate result (the specific process can be as described in the above example, No longer).
  • the controller in the processor controls the processing element to calculate the product of the elements stored in the corresponding register to obtain the element product, and then calculate the sum of the element product of each row to obtain the first intermediate result (the specific process can be as described in the above example, No longer).
  • Processing the first intermediate result can obtain the second intermediate results a 11 ⁇ b 11 , a 12 ⁇ b 21 , a 21 ⁇ b 12 and a 22 ⁇ b 22 .
  • the controller can control the processing element to calculate the second intermediate results a 11 ⁇ b 11 , a 12 ⁇ b 21 , a 21 ⁇ b 12 and a 22 ⁇ b 22 .
  • C 22 a 21 ⁇ b 12 +a 22 ⁇ b 22 .
  • the controller can also control the processing element to calculate the second intermediate results a 11 ⁇ b 12 , a 12 ⁇ b 22 , a 21 ⁇ b 11 and a 22 ⁇ according to the process of step S1-11-step S1-13.
  • b 21 Store the first column of b 11 in the register of the processing element where the first row and second row of a 21 are located, and store the first column of b 21 into the first row and second row of a 22 In the register of the processing element, store the first column of b 12 in the register of the processing element where the first row and second row of a 11 are located, and store the first column of b 22 in the first row of a 12 and In the register of the processing element where the second row is located, then the controller in the processor controls the processing element to calculate the product of the elements stored in the corresponding register to obtain the element product, and then calculate the sum of the element product of each row to obtain the first intermediate result; For the second column of b 11 , b 12 , b 21 , and b 22 , use a similar method to store and calculate the product, and add the rows to get the first intermediate result, and process the first intermediate result to get the second intermediate result a 11 ⁇ b 12 , a 12 ⁇ b 22 , a 21 ⁇
  • the controller may also be first stored in the first column 11 b to a first and second rows 11 where In the register of the processing element in the register of the processing element where the first row and second row of a 21 are located, store the first column of b 21 in the register of the processing element where the first row and second row of a 12 are located In the register of the processing element where the first and second rows of a 22 are located.
  • the controller in the processor controls the processing element to calculate the product of the elements stored in the corresponding register to obtain the element product, and then calculate the sum of the element product of each row to obtain the first intermediate result.
  • the controller may control the processing element to calculate the second intermediate results a 11 ⁇ b 11 , a 12 ⁇ b 21 , a 21 ⁇ b 11 and a 22 ⁇ b 21 according to the first intermediate result.
  • the above process can also be repeated to obtain the second intermediate results a 11 ⁇ b 12 , a 12 ⁇ b 22 , a 21 ⁇ b 12 and a 22 ⁇ b 22 .
  • the specific process will not be repeated.
  • the product of the input matrix can be calculated.
  • the product of the input matrix can be calculated in a block-wise manner. Therefore, the matrix multiplication operation method according to the present disclosure can realize matrix operations of any size. Moreover, compared with the matrix multiplication operation in the related technology, the number of memory accesses can be reduced, the bandwidth pressure can be reduced, and the efficiency of the operation can be improved.
  • steps in the flowchart are displayed in sequence as indicated by the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless specifically stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least part of the steps in the flowchart may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The execution of these sub-steps or stages The sequence is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.
  • Figure 1-1 shows an example of a processor.
  • the processor may include more than two processing elements, which are arranged in a two-dimensional matrix, and each processing element includes at least one register. Matrix multiplication of the first matrix and the second matrix.
  • the processor further includes a controller, and the controller is configured to load the first matrix into the register of the processing element;
  • the controller For each row of the second matrix, the controller is configured to store the element in each row in the register of the processing element stored in each column of the first matrix, and to multiply it with the element in each column of the first matrix. , Calculate the sum of the products of a column to obtain the first intermediate result; or, for each column of the second matrix, the controller is used to store the elements in each column to the register of the processing element stored in each row of the first matrix , Calculate the product with the elements in each row of the first matrix, calculate the sum of the product of a row to obtain the first intermediate result;
  • the controller is also used to process the first intermediate result to obtain the product of the first matrix and the second matrix.
  • the first matrix may be one of a plurality of first matrices obtained after the matrix to be loaded is divided into blocks, and the matrix to be loaded may be a left-multiplied matrix or a right-multiplied matrix.
  • the other matrix in the input matrix except the matrix to be loaded is the second matrix.
  • the first matrix may not be a partitioned matrix.
  • the first matrix may be a left-multiplying matrix or a right-multiplying matrix in the input matrix
  • the second matrix may be another matrix in the input matrix.
  • the controller of the processor of the present disclosure can also determine from the input matrix that the matrix that does not need to be partitioned is the first matrix according to the arrangement of the processing elements, and the The other matrix is the second matrix, and the input matrix includes a left multiplication matrix and a right multiplication matrix.
  • the first matrix is a left-multiplying matrix and the second matrix is a right-multiplying matrix.
  • the controller is configured to store each element in the column element.
  • To the register of the processing element stored in the corresponding column of the element in the first matrix control each processing element to multiply the element in the corresponding register to obtain the element product, and calculate the sum of the element product of each row to obtain the first intermediate result, where, A column of elements in the first matrix corresponding to each element means that the number of rows of the element in the second matrix is the same as the number of columns of the element in the column.
  • the first matrix is a right-multiplying matrix and the second matrix is a left-multiplying matrix.
  • the controller is configured to Stored to the register of the processing element stored in the corresponding row of the element in the first matrix, control each processing element to multiply the element in the corresponding register to obtain the element product, and calculate the sum of the element product of each column to obtain the first intermediate result, where A row of elements in the first matrix corresponding to each element means that the number of columns of the element in the second matrix is the same as the number of rows where the elements of a row are located.
  • the controller is also used to determine the matrix to be loaded from the input matrix; wherein the input matrix includes a left-multiplying matrix and a right-multiplying matrix, and the to-be-loaded matrix is a left-multiplying matrix or a right-multiplying matrix; according to The arrangement of processing elements and the row rank and column rank of the matrix to be loaded determine whether to block the matrix to be loaded; if the matrix to be loaded is to be divided into blocks, the controller is used for the arrangement of the elements to be processed and the row of the matrix to be loaded The rank and the column rank divide the matrix to be loaded into blocks to obtain two or more first matrices.
  • the controller is further configured to block another matrix in the input matrix except the matrix to be loaded to obtain two or more second matrices according to the manner in which the matrix to be loaded is divided into blocks;
  • the processor includes multiple sets of registers. After the input matrix is divided into blocks, the controller is further configured to stack and store the two or more first matrices in the multiple sets of registers, each of which stores A first matrix.
  • the controller may also calculate the product of the left multiplication matrix and the right multiplication matrix according to the product of the first matrix and the corresponding second matrix according to the rule of matrix multiplication.
  • the embodiment of the present disclosure also provides an artificial intelligence chip, which includes the processor as described above.
  • a board card which includes a storage device, an interface device, a control device, and the aforementioned artificial intelligence chip; wherein, the artificial intelligence chip is connected to the storage device and the control device. And the interface devices are respectively connected; the storage device is used to store data; the interface device is used to realize data transmission between the artificial intelligence chip and an external device; the control device is used to The state of the artificial intelligence chip is monitored.
  • the method includes:
  • the first intermediate result is processed to obtain the product of the first matrix and the second matrix.
  • the first matrix is a left-multiplying matrix and the second matrix is a right-multiplying matrix
  • each column element in the second matrix For each column element in the second matrix, store each element in the column element in the register of the processing element stored in the corresponding column element in the first matrix, and control each processing element to multiply the elements in the corresponding register Operate to get the element product, calculate the sum of the element product of each row to get the first intermediate result,
  • a column of elements in the first matrix corresponding to each element means that the number of rows of the element in the second matrix is the same as the number of columns of the element in a column.
  • the first matrix is a right-multiplied matrix and the second matrix is a left-multiplied matrix
  • each row element in the second matrix For each row element in the second matrix, store each element in the row element in the register of the processing element stored in the corresponding row element in the first matrix, and control each processing element to multiply the elements in the corresponding register Operate to get the element product, calculate the sum of the element product of each column to get the first intermediate result,
  • a row of elements in the first matrix corresponding to each element means that the number of columns of the element in the second matrix is the same as the number of rows where the elements of a row are located.
  • the processing elements it is determined from the input matrix that the matrix that does not need to be partitioned is the first matrix, and the other matrix in the input matrix is the second matrix.
  • the input matrix includes a left-multiplying matrix and a right-multiplying matrix
  • the to-be-loaded matrix is a left-multiplying matrix or a right-multiplying matrix
  • the matrix to be loaded is divided into blocks, the matrix to be loaded is divided into blocks according to the arrangement of the elements to be processed and the row rank and column rank of the matrix to be loaded to obtain two or more first matrices.
  • Clause A6 The method according to Clause A5, the method further comprising:
  • the product of the left multiplication matrix and the right multiplication matrix is calculated according to the rule of matrix multiplication.
  • Clause A7 The method of clause A5, wherein the processor includes multiple sets of registers, and the method further includes:
  • the two or more first matrices are stacked and stored in the multiple sets of registers, and each group stores one first matrix.
  • a processor includes two or more processing elements, the two or more processing elements are arranged in a two-dimensional matrix, the processing element includes at least one register, and the processor is configured to compare the first matrix and the second Two matrices perform matrix multiplication operations,
  • the processor also includes a controller for loading the first matrix into the register of the processing element
  • the controller For each row of the second matrix, the controller is configured to store the element in each row in the register of the processing element stored in each column of the first matrix, and to multiply it with the element in each column of the first matrix. , Calculate the sum of the products of a column to obtain the first intermediate result; or, for each column of the second matrix, the controller is used to store the elements in each column to the register of the processing element stored in each row of the first matrix , Calculate the product with the elements in each row of the first matrix, calculate the sum of the product of a row to obtain the first intermediate result;
  • the controller is also used to process the first intermediate result to obtain the product of the first matrix and the second matrix.
  • Clause A9 The processor according to clause A8, wherein the first matrix is a left-multiplying matrix and the second matrix is a right-multiplying matrix,
  • the controller For each column element in the second matrix, the controller is used to store each element in the column element in the register of the processing element stored in the corresponding column element in the first matrix, and control each processing element to the corresponding register
  • the elements within are multiplied to obtain the element product, and the sum of the element product of each row is calculated to obtain the first intermediate result
  • a column of elements in the first matrix corresponding to each element means that the number of rows of the element in the second matrix is the same as the number of columns of the element in a column.
  • Clause A10 The processor according to clause A8, wherein the first matrix is a right-multiplying matrix and the second matrix is a left-multiplying matrix,
  • the controller For each row element in the second matrix, the controller is used to store each element in the row element in the register of the processing element stored in the corresponding row element in the first matrix, and control each processing element to the corresponding register
  • the elements within are multiplied to obtain the element product, and the sum of the element product of each column is calculated to obtain the first intermediate result
  • a row of elements in the first matrix corresponding to each element means that the number of columns of the element in the second matrix is the same as the number of rows where the elements of a row are located.
  • Clause A11 The processor according to any one of clauses A8-A10, wherein the processor is further configured to determine from the input matrix that the matrix that does not need to be partitioned is the first matrix according to the arrangement of the processing elements, and the input matrix The other matrix of is the second matrix, and the input matrix includes a left multiplication matrix and a right multiplication matrix.
  • Clause A12 The processor according to any one of clauses A8-A10, wherein the controller is further configured to determine a matrix to be loaded from an input matrix; wherein the input matrix includes a left multiplication matrix and a right multiplication matrix, and the matrix to be loaded is Multiply the matrix to the left or the matrix to the right; determine whether to block the matrix to be loaded according to the arrangement of processing elements and the row rank and column rank of the matrix to be loaded;
  • the controller is configured to block the matrix to be loaded according to the arrangement of the elements to be processed and the row rank and column rank of the matrix to be loaded to obtain two or more first matrices.
  • Clause A14 The processor of clause A12, wherein the processor includes multiple sets of registers, and after the input matrix is divided into blocks, the controller is further configured to stack and store the multiple sets of registers. There are more than two first matrices, and each group stores one first matrix.
  • Clause A16 An electronic device including the artificial intelligence chip as described in Clause A15.
  • the present disclosure provides an operation method and a processor for executing the operation method.
  • the processor may include multiple processing elements (more than two), these processing elements may be arranged in a two-dimensional matrix, and each processing element may include at least one register.
  • FIG 2-1 shows a schematic diagram of a processor according to an embodiment of the present disclosure.
  • multiple processing elements PE Processing Element
  • each processing element is connected to adjacent processing elements.
  • Each PE can be provided with at least one register ( register) (not shown in the figure).
  • the processor may also include a controller and a memory, where both the controller and the memory are connected to multiple processing elements, and the controller may be connected to the memory.
  • the controller is used to load input data from the memory to the register of the processing element, and control the processing element to process the input data.
  • the memory may store a first matrix and a second matrix, and the processor is used to The matrix and the second matrix perform a matrix multiplication operation. Therefore, the controller can load the first matrix and the second matrix into the register of the processing element and control the processing element to perform the matrix multiplication operation.
  • an executable program may also be stored in the memory, and the executable program may include instructions, and the execution instructions may implement matrix multiplication operations on the first matrix and the second matrix.
  • the controller can be provided with a loader, a decoder, etc., where the loader can be used to load the input data in the memory into the register of the processing element, and the decoder can execute the executable according to the storage address of the input data after loading.
  • the instructions for accessing data in the program are decoded. For example, for instructions for accessing data, the address stored in the register of the data obtained by decoding is assigned to the instruction for accessing data, and the decoded instruction is sent to the processing element.
  • the processing element executes instructions to implement data processing, for example, implement matrix multiplication operations on the first matrix and the second matrix.
  • the memory may be an on-chip cache
  • the controller may load the executable program on the off-chip flash memory and input data (for example, the input matrix, including the left multiplication matrix and the right multiplication matrix) into the above-mentioned memory ( In the on-chip cache), the subsequent matrix multiplication process is performed.
  • the controller can also directly load the input matrix and the executable program from the off-chip memory to the register of the processing element, which is not limited in the present disclosure.
  • the PE may also include an arithmetic unit to complete the specified operation. Taking matrix operation as an example, the PE may include, for example, a multiplier, an adder, etc.
  • the specific structure of each PE may be the same or different, and this disclosure will not make this limited.
  • the PE may also include other types of arithmetic units to adapt to various different arithmetic processes. The present disclosure does not limit the number and types of arithmetic units included in the PE.
  • the input matrix of the matrix multiplication operation may include a left multiplication matrix and a right multiplication matrix, where the left multiplication matrix may refer to the matrix located on the left side of the multiplication sign, and the right multiplication matrix may refer to the matrix located on the right side of the multiplication sign.
  • the controller can determine whether it is correct according to the arrangement of the processing elements and the row rank and column rank of the input matrix. Input the matrix to block.
  • the arrangement of the processing elements can refer to the number of rows and columns of the processing elements
  • the row rank and column rank of the input matrix can refer to the number of rows and columns of the left multiplying matrix and the right multiplying matrix.
  • the controller determining whether to block the input matrix according to the arrangement of the processing elements and the row rank and column rank of the input matrix may refer to: Whether the number of columns is greater than that of the processing element, it is determined whether to block the input matrix according to the result of the judgment.
  • the input matrix may not be divided into blocks.
  • the controller can block the input matrix.
  • the array of processing elements can be represented as PE MN , which means that the processing elements form an M ⁇ N matrix, M represents the number of rows of the matrix, and N represents the number of columns of the matrix.
  • an input matrix is A mn , which means An m ⁇ n matrix, m represents the number of rows of the matrix, n represents the number of columns of the matrix, and the other input matrix is B nk , which represents an n ⁇ k matrix, n represents the number of rows of the matrix, and k represents the number of columns of the matrix.
  • the input matrix may not be divided into blocks.
  • the transposed matrix of A mn The number of rows n is not greater than the number of rows M of the processing element, and the number of columns m is not greater than the number of columns N of the processing element, and the number of rows n of B nk is not greater than the number of rows M of the processing element, and the number of columns k is not greater than the number of processing elements. If the number of columns of components is N, the input matrix may not be divided into blocks.
  • the input matrix can be divided into blocks; or, if The number of rows n is greater than the number of rows M of the processing element, or the number of columns m is greater than the number of columns N of the processing element, or the number of rows n of B nk is greater than the number of rows M of the processing element, or the number of columns k is greater than the number of columns of the processing element N, the input matrix can be divided into blocks.
  • the controller can split the rows of the left multiplication matrix or the columns of the right multiplication matrix according to the arrangement of the processing elements.
  • the controller can divide the column direction of the left multiplication matrix and the row direction of the right multiplication matrix in the same way according to the arrangement of processing elements and the row rank and column rank of the input matrix. Piece.
  • the left multiplication matrix and the transposed right multiplication matrix can be divided in the same manner in the column direction, or the transposed left multiplication matrix and the right multiplication matrix can be performed in the same manner in the row direction. Partitioning, wherein the division in the same manner refers to that the number of columns or rows of the first matrix and the second matrix obtained after the division are the same, so as to ensure that the matrix operation can be completed normally.
  • the column direction of the left multiplication matrix and the row direction of the right multiplication matrix are divided in the same way.
  • the condition for further blocking is required, that is, the number of transposed rows of the first matrix and the second matrix is not greater than the number of rows of the processing element, and the number of columns is not greater than the number of columns of the processing element, or the transformation of the first matrix
  • the number of rows in the second matrix is not greater than the number of rows of the processing element, and the number of columns is not greater than the number of columns of the processing element.
  • the controller can divide the first matrix or the second matrix in such a way that the row rank and column rank of the divided first matrix or the second matrix are as close as possible to the number of rows and columns of the processing element, which can improve the efficiency of the operation. , Shorten the calculation time. That is to say, assuming that the processing element is a 4 ⁇ 4 array, it can be divided first according to the way that the divided matrix is 4 ⁇ 4, so that the processing element can be used with maximum efficiency and the calculation efficiency can be improved.
  • Figure 2-2a and Figure 2-2b respectively show a variety of different ways of dividing.
  • the matrix A 24 is divided into blocks in the same manner in the column direction and the matrix B 43 is divided into the row direction.
  • Figure 2-2a is an example of division.
  • Matrix A 24 is divided into two parts in the column direction, each part includes two columns, and matrix B 43 is divided into two parts in the row direction, and each part includes two rows;
  • Figure 2-2b is the division
  • matrix A 24 is divided into three parts in the column direction, one part includes two columns, and the other two parts both include one column.
  • Matrix B 43 is divided into three parts in the row direction, one part includes two rows, and the other two parts are both Include one line.
  • the above arrangement of processing elements and the division of the input matrix are merely an example of the present disclosure, and do not limit the present disclosure in any way.
  • the present disclosure does not make specific limitations on the division of the row direction of the left multiplication matrix and the column direction of the right multiplication matrix, as long as the divided matrix needs to meet the condition that no more block is required.
  • the elements in the rows of the left multiplication matrix and the elements in the columns of the right multiplication matrix are multiplied one by one, and then summed. Therefore, in a possible implementation manner, for the case of non-blocking, or the first matrix and the corresponding second matrix after the block, the controller is used to transform the transposed matrix of the first matrix and the second matrix Each element of is loaded into the register of each processing element, and the elements at the corresponding positions of the transposed matrix and the second matrix are stored in the register of the same processing element.
  • the elements at the corresponding positions of the transposed matrix and the second matrix may refer to the elements in the transposed matrix and the second matrix that need to be multiplied.
  • the controller can first transpose the first matrix to obtain the transposed matrix, and then load the elements of the transposed matrix into the registers of each processing element, or, in another possible implementation In this way, the controller can also implement the transposition of the first matrix during the loading process. For example, if the first matrix is a right-multiplied matrix, then the controller loads the first matrix element to the register of each processing element. In the process, a column of elements of the first matrix can be loaded into the registers of a row of processing elements to realize the transposition of the first matrix.
  • the transposed matrix and the second matrix are aligned in the row or column direction. Specifically, if the left multiplication matrix is transposed, then after loading, the rows of the transposed matrix of the first matrix are aligned with the second matrix in the column direction, that is, in the column direction, the transposed matrix and the second matrix Row alignment; if the right multiplication matrix is transposed, then after loading, the columns of the transposed matrix are aligned with the second matrix in the row direction, that is, in the row direction, the columns of the transposed matrix and the second matrix are aligned.
  • the controller After loading the transposed matrix and the second matrix, the controller is also used to control the elements in the transposed matrix or the second matrix to scroll in the row direction or the column direction, and control the processing element to control the elements in the corresponding register. Perform multiplication to obtain the element-wise product, and sum the element-wise products in the same row or column to obtain the first intermediate result.
  • the controller controls the processing element, the transposed matrix stored in the register, and the second matrix to repeat the following process until the elements in the transposed matrix or the second matrix return to the unrolled position: the controller controls the pair of processing elements
  • the element in the corresponding register is multiplied to obtain the element product, and the element product in the same row or column is summed to obtain the first intermediate result, and the transposed matrix or the second matrix stored in the register is controlled to scroll in the row direction or the column direction.
  • first control the processing element to multiply the elements in the corresponding register to obtain the element product, and to sum the element product of the same row or the same column to obtain the first intermediate result, and then control the transpose matrix or the second matrix
  • the element scrolls one row or one column in the row direction or column direction.
  • the initial position can refer to the transposed matrix or the second matrix. The position of the element when it is not scrolled. If the judgment result is the same, then the process ends.
  • control the processing element to multiply the elements in the corresponding register to obtain the element product, sum the element products in the same row or the same column to obtain the first intermediate result, and then control the transpose matrix or the second
  • the elements in the matrix are scrolled by one row or one column in the row or column direction, and judge whether the elements in the transposed matrix or the second matrix are the same as the initial position after the scrolling..., repeat the above process until the transposed matrix or the second The elements in the matrix are the same as the initial positions.
  • the first matrix is a left-multiplying matrix
  • the second matrix is a right-multiplying matrix.
  • the first matrix is a right-multiplying matrix
  • the second matrix is a left-multiplying matrix
  • the controller controls the elements in the transposed matrix to scroll in the row direction, or controls the elements of the second matrix to scroll in the row direction, and controls the processing elements to
  • the element in the register is multiplied to obtain the element product, and the element product of the same column is summed to obtain the first intermediate result.
  • the controller controls the elements in the transposed matrix to scroll in the column direction, or controls the elements in the second matrix to scroll in the column direction; control the processing element
  • the element in the corresponding register is multiplied to obtain the element product, and the element product in the same row is summed to obtain the first intermediate result.
  • the aforementioned scrolling scrolls one row or one column at a time.
  • a closed loop is formed between the processing elements storing the elements of the matrix. Since the adjacent processing elements are connected together, the controller can determine the way to form a loop according to the dimension of the matrix, for example, if you want to scroll by row (Scroll in the column direction), then, the first row of processing elements and the last row of processing elements that store the elements of the matrix are connected. The position scrolls to the position where the last row element is stored. If you want to scroll by column (scrolling in the row direction), then the first column of processing elements and the last column of processing elements that store the elements of the matrix are connected.
  • connection between the processing element and the processing element may refer to a virtual connection, that is, there is no actual connection line, but the controller records the corresponding processor and forms a closed loop during the scrolling process.
  • the controller may process the first intermediate result to obtain the first matrix and the first intermediate result.
  • the controller stores the first intermediate result in rows or columns, and after scrolling in the row direction or the column direction, the product of the first matrix and the second matrix is obtained.
  • the specific processing method is related to the matrix to be transposed and the direction of scrolling, for example:
  • the first intermediate result can be stored in columns, and the The element scrolls to the right in the row direction; for example, the i-th row element scrolls to the right in the row direction by i-1 steps;
  • the first matrix is a right-multiplying matrix and the second matrix is a left-multiplying matrix
  • the first intermediate result can be stored in columns, and the first intermediate result can be stored in the column
  • the element of is scrolled to the left in the row direction; for example, the i-th row element is scrolled to the left in the row direction by i-1 steps;
  • the first matrix is a left-multiplying matrix and the second matrix is a right-multiplying matrix
  • the transposed matrix when the transposed matrix is scrolled to the left in the row direction, the first intermediate result can be stored in rows, and the i-th Scroll down the column elements in the column direction by step i-1 to get the product of the input matrix;
  • the first matrix is a left-multiplied matrix and the second matrix is a right-multiplied matrix
  • the transposed matrix is scrolled to the right in the row direction
  • the first intermediate result can be stored in rows, and the i-th
  • the column elements are scrolled up in the column direction by i-1 steps to obtain the product of the input matrix.
  • the processor provided by the present disclosure can block the input matrix and then stack and store it, and at the same time perform matrix multiplication on the corresponding matrix after the block, which can reduce the memory access frequency and improve the operation efficiency.
  • the controller is also used for dividing according to the first matrix
  • the product of the second matrix is calculated as the product of the left-multiplying matrix and the right-multiplying matrix. That is to say, for the first matrix and the corresponding second matrix after the block, the product of the first matrix and the second matrix is calculated respectively, and then the left multiplication matrix and the right multiplication matrix are calculated according to the product of the first matrix and the second matrix. product. This can reduce the frequency of memory access and improve computing efficiency.
  • the processor includes multiple sets of registers.
  • the controller can divide the registers of the processing elements into multiple groups according to the block of the matrix.
  • the controller can transpose two or more of the first matrices to obtain a transposed matrix after dividing the input matrix; the controller can transpose the matrix and the two or more second matrices.
  • the matrix is loaded into the plurality of sets of registers for stack storage, and a set of registers stores the transposed matrix and the second matrix at corresponding positions.
  • the controller Before each element in the transposed matrix or the second matrix is scrolled in the row direction or the column direction, the controller controls the processing element to multiply the elements in the corresponding register to obtain the element product.
  • the element product summation obtains the first intermediate result; after controlling the elements in a group of registers to scroll one row or one column of the transposed matrix in the row or column direction, the controller also corrects the rolling result.
  • correcting the rolling result includes:
  • the correction method is to scroll the last column of data in each transposed matrix after scrolling to the last column of the adjacent previous transposed matrix data;
  • the correction method is to scroll the first column of data in each block of transposed matrix after scrolling to the first column of the next adjacent block of transposed matrix data;
  • the correction method is to scroll the last row of data in each transposed matrix after scrolling to the last row of the adjacent previous transposed matrix data
  • the correction method is to scroll the first row of data in each block of transposed matrix after scrolling to the first row of the next adjacent block of transposed matrix data;
  • each block of the transposed matrix refers to the matrix after each block of the matrix is transposed.
  • the specific calculation and correction process will be described in detail in the example below.
  • the present disclosure also provides an operation method for realizing matrix multiplication operation.
  • Figs. 2-3 show a flowchart of an operation method according to an embodiment of the present disclosure.
  • the left multiplication matrix can also be directly used as the first matrix and the right multiplication matrix as the second matrix, or the left multiplication matrix can be directly used as the second matrix and the right multiplication matrix can be used as the first matrix. Not limited.
  • the calculation method provided by the present disclosure may include the following steps:
  • Step S2-11 Transpose the first matrix to obtain a transposed matrix, load the transposed matrix and the second matrix into the register of the processing element, and store the elements at the corresponding positions of the transposed matrix and the second matrix in the same processing element. In the register.
  • the elements at the corresponding positions of the transposed matrix and the second matrix may refer to the elements in the transposed matrix and the second matrix that need to be multiplied.
  • the transposed matrix and the second matrix are aligned in the row or column direction. Specifically, if the left multiplication matrix is transposed, then after loading, the rows of the transposed matrix of the first matrix are aligned with the second matrix in the column direction, that is, in the column direction, the transposed matrix and the second matrix Row alignment; if the right-multiplication matrix is transposed, then after loading, the columns of the transposed matrix are aligned with the second matrix in the row direction, that is, in the row direction, the columns of the transposed matrix and the second matrix are aligned.
  • Step S2-12 control the transposed matrix or the second matrix to scroll in the row direction or column direction, control the processing element to multiply the elements in the corresponding register to obtain the element product, and obtain the element product in the same row or column And get the first intermediate result.
  • step S2-12 may specifically include repeating the following process until the elements in the transposed matrix or the second matrix are restored to their unrolled positions: controlling the processing element to perform operations on the elements in the corresponding register.
  • the multiplication operation obtains the element product, and the sum of the element products in the same row or the same column obtains the first intermediate result; in the matrix of the processing element, the transposed matrix or the second matrix is scrolled by one row or one column in the row direction or the column direction.
  • Step S2-13 processing the first intermediate result to obtain the product of the first matrix and the second matrix.
  • the processing element is first controlled to multiply the elements in the corresponding register to obtain the element product, and the element products in the same row or column are summed to obtain the first intermediate result. Then control the elements in the transposed matrix or the second matrix to scroll one row or one column in the row direction or the column direction. At this time, it can be judged whether the elements in the transposed matrix or the second matrix are the same as the initial position after scrolling, where the initial position It can refer to the position of the elements in the transposed matrix or the second matrix when they are not scrolled. If the judgment result is the same, then this process is ended, and step S2-13 is continued.
  • control the processing element to multiply the elements in the corresponding register to obtain the element product, sum the element products in the same row or the same column to obtain the first intermediate result, and then control the transpose matrix or the second
  • the elements in the matrix are scrolled by one row or column in the row direction or column direction, and judge whether the elements in the transposed matrix or the second matrix are the same as the initial position after the scrolling...
  • the elements in the matrix are the same as the initial positions.
  • the first matrix is a left-multiplying matrix
  • the second matrix is a right-multiplying matrix.
  • the first matrix is a right-multiplying matrix
  • the second matrix is a left-multiplying matrix
  • step S2-12 control the elements in the transposed matrix to scroll in the row direction, or control the elements in the second matrix to scroll in the row direction, and control the processing
  • the element multiplies the elements in the corresponding register to obtain the element product, and sums the element products in the same column to obtain the first intermediate result.
  • step S2-12 control the elements in the transposed matrix to scroll in the column direction, or control the elements in the second matrix to scroll in the column direction .
  • the control processing element multiplies the elements in the corresponding register to obtain the element product, and sums the element products of the same row to obtain the first intermediate result.
  • the aforementioned scrolling scrolls one row or one column at a time.
  • processing the first intermediate result may refer to: storing the first intermediate result in rows or columns, and scrolling in the row direction or column direction to obtain the product of the first matrix and the second matrix.
  • the specific processing method is related to the matrix to be transposed and the direction of scrolling, for example:
  • the first intermediate result can be stored in columns, and the elements in the first intermediate result Scroll to the right in the row direction; for example, the i-th row element scrolls to the right in the row direction by i-1 steps;
  • the first intermediate result can be stored in columns, and the The element scrolls to the left in the row direction; for example, the i-th row element scrolls to the left i-1 step in the row direction;
  • the first matrix is a left-multiplying matrix and the second matrix is a right-multiplying matrix
  • the transposed matrix when the transposed matrix is scrolled to the left in the row direction, the first intermediate result can be stored in rows, and the i-th Scroll down the column elements in the column direction by step i-1 to get the product of the input matrix;
  • the first matrix is a left-multiplied matrix and the second matrix is a right-multiplied matrix
  • the transposed matrix is scrolled to the right in the row direction
  • the first intermediate result can be stored in rows, and the i-th
  • the column elements are scrolled up in the column direction by i-1 steps to obtain the product of the input matrix.
  • the following will take the first matrix as the right multiplying matrix, the second matrix as the left multiplying matrix, and the first matrix as the left multiplying matrix, and the second matrix as the right multiplying matrix as examples for the steps S2-11-step S2-13. The process is explained.
  • Example 2-1 The first matrix is a right-multiplied matrix, and the second matrix is a left-multiplied matrix, that is, the right-multiplied matrix is transposed.
  • the processing elements form a 4 ⁇ 4 array.
  • Figures 2-4 show schematic diagrams of an array composed of processing elements according to an embodiment of the present disclosure. The calculation method of the present disclosure will be described with reference to FIGS. 2-4 and 2-3.
  • Loading the second matrix into the register of the processing element can be loaded into the register of the processing element according to the arrangement of rows and columns of the second matrix, that is, the elements in the second matrix are The arrangement is the same as the arrangement in the register of the processing element.
  • the number of rows and columns of the elements in the second matrix in the matrix is the same as the number of rows and columns of the processing element loaded with the element in the array of processing elements.
  • a 11 can be loaded into the register of PE 11
  • a 12 can be loaded into the register of PE 12
  • a 13 can be loaded into the register of PE 13
  • a 21 can be loaded into the register of PE 21.
  • ...A 33 is loaded into the register of PE 33 , that is, the subscript of the element in the second matrix can be exactly the same as the subscript of the processing element where it is located.
  • a 11 can be loaded into the register of PE 12
  • a 12 can be loaded into the register of PE 13
  • a 13 can be loaded into the register of PE 14
  • a 21 can be loaded into the register of PE 22 ...
  • a 33 Loaded into the register of the PE 34 , that is, the arrangement of the elements in the second matrix in the matrix is the same as the arrangement in the register of the processing element.
  • the transposed matrix can be loaded into the register of the processing element according to the manner of loading the first matrix, or in other words, after loading, the columns of the second matrix are aligned with the columns of the transposed matrix, and the transposed matrix and the first matrix are loaded after loading.
  • the elements at the corresponding positions of the two matrices are stored in the registers of the same processing element.
  • a 33 is loaded into In the register of PE 33 , that is, the subscript of the element in the first matrix can be exactly the same as the subscript of the processing element where it is located. Then, you can load B 11 into the register of PE 11 , load B 21 into the register of PE 12 , load B 31 into the register of PE 13 , load B 12 into the register of PE 21 , and load B 22 into the register of PE 22. B 32 is loaded into the register of PE 23 ...B 33 is loaded into the register of PE 33. That is, the transposed matrix is loaded into the register of the processing element in a sorting manner aligned with the columns of the second matrix.
  • the transposed matrix first and then load the second matrix, or load at the same time.
  • the present disclosure does not limit the specific loading method, as long as it is ensured that the transposed matrix and the second matrix are in the row direction after loading.
  • the elements at the corresponding positions of the transposed matrix and the second matrix are stored in the registers of the same processing element.
  • the processing element storing the first row element of the transposed matrix can be connected in the column direction with the last of the stored transposed matrix
  • the processing elements of a row of elements form a ring, and the data in the ring can flow to realize the scrolling of the matrix in the column direction.
  • PE 11 and PE 31 can be connected to form a ring
  • PE 12 and PE 32 can be connected to form a ring
  • PE 13 and PE 33 can be connected to form a ring.
  • the controller can control the processing element to multiply the elements in the corresponding register to obtain the element product.
  • the element-wise product of is summed to get the first intermediate result.
  • the controller can control the PE 11 to multiply the elements A 11 and B 11 stored in its registers to obtain the element product A 11 ⁇ B 11.
  • the controller can control PE 12 and PE 13 to Get A 12 ⁇ B 21 , A 13 ⁇ B 31 ,
  • C 11 , C 22 and C 33 may be temporarily stored in the buffer as the first intermediate result of the first column.
  • the buffer can be located outside of multiple processing elements in the processor.
  • the transposed matrix can be scrolled up by one row, and the elements of the first row are scrolled to the last row (of the processing elements storing the elements of the matrix).
  • the transposed matrix can also be scrolled down by one row.
  • the present disclosure does not limit the specific scrolling direction. For the example in this embodiment, it is sufficient to scroll in the column direction in units of rows.
  • redundant registers in the processing element or on-chip cache in the processor can be used to implement the rolling process of the data in the matrix. This embodiment is applicable to the rolling process in Example 2-1 and Example 2-2 of the present disclosure.
  • the elements of the first row of the transposed matrix can be temporarily stored in the redundant register, and the processing element of the second row can be controlled to store the second row of the transposed matrix in the corresponding register.
  • the row element is sent to the processing element in the first row, and then the processing element in the third row is controlled to send the third row element of the transposed matrix stored in the corresponding register to the processing element in the second row.
  • the temporarily stored first row can be sent to the processing element in the second row.
  • the elements of one row are stored in the register corresponding to the processing element in the third row, so as to realize the rolling process of the data of one row of the transposed matrix.
  • the first intermediate result stored in the buffer is
  • the processing of the first intermediate result means that the controller stores the obtained first intermediate result in columns, and then the controller stores the first intermediate result
  • the i-th row element in the row direction is scrolled to the right by step i-1 to obtain the product of the input matrix.
  • the scrolling here also refers to the rolling in the direction of the row in a closed loop.
  • the first column of processing elements and the last column of the elements of the matrix are stored The processing elements are connected to form a closed loop. During the scrolling process, if you scroll to the right, the elements stored in the last column of processing elements are scrolled to the first column of processing elements.
  • the processing of the first intermediate result means that the controller stores the obtained first intermediate result in columns, and then controls the The device scrolls the element of the i-th row in the first intermediate result to the left in the row direction by step i-1 to obtain the product of the input matrix.
  • the controller may also place the elements in the first intermediate result in the row direction (for example, scroll to the right or scroll to the left) according to the row and column identifiers of the first intermediate result. ) Scroll to get the product of the input matrix.
  • the elements stored in the register can all carry the row and column identification of the element in the matrix.
  • the row and column identification of the element in the matrix is used to determine the row and column of the element in the first intermediate result. The identification, so that the controller can scroll the elements in the first intermediate result in the row direction according to the row and column identification of the first intermediate result to obtain the product of the first matrix and the second matrix.
  • the first row is scrolled to the right by 0 steps, that is, it is not scrolled.
  • the second row scrolls to the right by 1 step, that is to say, C 21 scrolls to the right by 1 step to the first column, C 23 scrolls to the right by 1 step to the third column, and C 22 scrolls to the right by 1 step to the second column.
  • step S2-12 the second matrix can also be scrolled in the column direction.
  • the specific process is similar to the process of transposed matrix scrolling, except for the processing in step S2-13 It is slightly different from the way of scrolling elements.
  • the specific derivation process will not be repeated in this disclosure, and refer to the above process.
  • Example 2-2 The first matrix is the left multiplication matrix and the second matrix is the right multiplication matrix, which means that the left multiplication matrix is transposed
  • both the first matrix a mn and the second matrix b nk are 3 ⁇ 3 matrices, and the processing element is a 4 ⁇ 4 array.
  • the transposed matrix for transposing the first matrix is Second matrix
  • the second matrix into the register of the output processing element.
  • the loading method please refer to the method of loading the first matrix in Example 2-1, which will not be repeated.
  • the transposed matrix is loaded into the processing according to the method of loading the second matrix.
  • the rows of the transposed matrix of the first matrix are aligned with the rows of the second matrix.
  • B 11 is loaded into the register of PE 11
  • B 12 is loaded into the register of PE 12
  • B 13 is loaded into the register of PE 13
  • B 21 is loaded into the register of PE 21
  • B 33 is loaded into In the register of PE 33 , that is, the subscript of the element in the first matrix can be exactly the same as the subscript of the processing element where it is located.
  • a 11 can be loaded into the register of PE 11
  • a 21 can be loaded into the register of PE 12
  • a 31 can be loaded into the register of PE 13
  • a 12 can be loaded into the register of PE 21
  • a 22 can be loaded into PE 22.
  • a 32 is loaded into the register of PE 23
  • a 33 is loaded into the register of PE 33. That is, the transposed matrix is loaded into the register of the processing element in a row-aligned order with another matrix (the second matrix).
  • the processing element storing the first column element of the transposed matrix can be connected in the row direction with the last element storing the transposed matrix
  • the processing elements of a column of elements form a ring, and the data in the ring can flow, so as to facilitate scrolling in the row direction in units of columns.
  • PE 11 and PE 13 can be connected to form a ring
  • PE 21 and PE 23 can be connected to form a ring
  • PE 31 and PE 33 can be connected to form a ring.
  • the controller can control the processor element to perform the scrolling on the elements in the corresponding register.
  • the multiplication operation obtains the element product, and the sum of the element products in the same column obtains the first intermediate result.
  • the PE 11 multiplies the elements A 11 and B 11 stored in its register to obtain the element product A 11 ⁇ B 11 , and similarly, A 12 ⁇ B 21 and A 13 ⁇ B 31 can be obtained.
  • C 11 , C 22 and C 33 may be temporarily stored in the buffer as the first intermediate result of the first row.
  • the transposed matrix can be scrolled to the left by one column, and the elements in the first column can be scrolled to the last column, or it can be scrolled to the right by one column, which is not limited in the present disclosure.
  • the first intermediate result stored in the buffer is
  • step S2-13 for the case of scrolling the first transposed matrix to the left, the first intermediate result may be stored in rows, and the controller may scroll down the i-th column element in the first intermediate result in the column direction i Step -1 gets the product of the input matrix.
  • the controller may store the first intermediate result in rows, and scroll the i-th column element in the first intermediate result upward in the column direction by step i-1.
  • the product of the input matrix The specific steps are similar to scrolling to the left, so I won’t repeat them here.
  • the controller can also move the elements in the first intermediate result in the column direction (for example, move up or down) according to the row and column identifiers of the first intermediate result. Scroll to get the product of the input matrix.
  • the elements stored in the register can all carry the row and column identification of the element in the matrix.
  • the row and column identification of the element in the matrix is used to determine the row and column of the element in the first intermediate result. The identifier, so that the controller can scroll the elements in the first intermediate result in the column direction according to the row and column identifier of the first intermediate result to obtain the product of the input matrix.
  • the first column is scrolled down by 0 steps, that is, it is not scrolled.
  • Column 2 is scrolled down by 1 step, that is, C 12 is scrolled down by 1 step to column 1
  • C 32 is scrolled down by 1 step to column 3
  • C 22 is scrolled down by 1 step to column 2, and you get The result is:
  • step S2-12 the second matrix can also be scrolled in the row direction.
  • the specific process is similar to the process of transposed matrix scrolling, except for the processing and The way of scrolling elements is slightly different. The specific derivation process will not be repeated in this disclosure, and refer to the above process.
  • the calculation method of matrix multiplication according to the foregoing embodiments of the present disclosure is more suitable for a processor composed of processing elements arranged in an array.
  • the result of the matrix multiplication can be obtained, and compared with the matrix multiplication in the related technology, the number of memory accesses can be reduced, the bandwidth pressure is reduced, and the efficiency of the operation is improved.
  • the result of matrix multiplication can be directly obtained according to the above example.
  • the result obtained by multiplying the first matrix and the corresponding second matrix according to the matrix multiplication rule is used as the second intermediate result, that is to say
  • the first matrix and the second matrix obtained after block division can be used as an element of the matrix to perform the operation process of matrix multiplication to obtain the second intermediate result, and the product of the input matrix can be obtained by calculation according to the second intermediate result.
  • Figures 2-5 show schematic diagrams of block division according to an embodiment of the present disclosure.
  • the controller can divide the matrices D and E into blocks in the manner described above to obtain the first matrix D 11 , D 12 , D 21 , D 22 , and the second matrix E 11 , E 12 , E 21 , E 22 .
  • the controller may use the first matrix and the second matrix as an element of the matrix to perform the operation process of matrix multiplication.
  • the second intermediate result needs to be obtained first:
  • the process of obtaining the second intermediate result can be obtained by performing operations on the corresponding first matrix and second matrix according to the process of step S2-11 to step S2-13, respectively.
  • the second intermediate result is obtained by dividing the input matrix into blocks and performing the matrix multiplication operation of the present disclosure on the divided matrix respectively, and the product of the input matrix can be calculated according to the second intermediate result. According to the operation method of the foregoing embodiment of the present disclosure, the process of matrix multiplication can be quickly realized for any dimension of the matrix.
  • the divided first matrix and second matrix may be sequentially stored in the processing element for calculation, or may also be stacked and stored in the processing element.
  • the processing element is a 2 ⁇ 2 array
  • the input matrix is a 4 ⁇ 4 matrix as an example to illustrate the operation method of the present disclosure.
  • the controller can divide both the left multiplication matrix and the right multiplication matrix into 2 ⁇ 2 matrices.
  • Figures 2-6 show examples of matrix division according to an embodiment of the present disclosure.
  • the controller can divide both the left multiplication matrix and the right multiplication matrix into 2 ⁇ 2 sub-matrices. After the left multiplication matrix is divided, four matrices a 11 , a 12 , a 21 and a 22 are obtained . Where a 11 is a 12 is a 21 is a 22 is After multiplying the matrix to the right, four matrices b 11 , b 12 , b 21 , and b 22 are obtained , where b 11 is b 12 is b 21 is b 22 is
  • the input matrix can also be stored in the register of the processing element in a stacked storage manner to implement the multiplication of the input matrix.
  • the controller can divide the registers in the processing element into multiple different groups, and each group stores a divided first matrix and a corresponding second matrix.
  • the grouping method is not limited, but each of the registers in the same group can be located in a different processing element.
  • one possible calculation method is to roll the matrix with the first matrix and the second matrix obtained by block as the unit, and in the process of calculating the second intermediate result, The calculation is performed using the process of step S2-11-step S2-13.
  • the first matrix can be obtained by multiplying the matrix on the left side, or obtained by multiplying the matrix on the right side.
  • the present disclosure takes the first matrix as an example to be obtained by multiplying the matrix to the right, loading the second matrix, transposing the corresponding first matrix and then loading it as an example to illustrate the calculation method.
  • the loading results are shown in Table 2-1 and As shown in Table 2-2.
  • Reg0, Reg1, Reg2, and Reg3 respectively represent a group of registers in the processing element.
  • the processing element is a 2 ⁇ 2 array.
  • Each processor includes multiple registers.
  • the controller can divide multiple registers into multiple groups. Taking this embodiment as an example, it can be divided into 4 groups. Registers in the same group are used to store a transposed matrix and a corresponding second matrix.
  • Reg0 stores a 11 and b 11
  • Reg1 stores a 12 and b 21
  • Reg2 stores a 21 and b 12
  • Reg3 stores a 22 and b 22 , that is, the matrix Multiply the elements of the first row by the matrix The elements in the first column and the elements in the second row are multiplied by the elements in the second column.
  • the processing element can calculate the second intermediate result a 11 ⁇ b 11 , a 12 ⁇ b 21 , a 21 ⁇ b according to the process of step S2-11-step S2-13 12 and a 22 ⁇ b 22 .
  • the specific process will not be repeated.
  • the transposed matrix can be scrolled in units of groups. Specifically, for the transposed matrix Scroll up one line, that is, scroll the elements of the transposed matrix in Reg2 to Reg0, the elements of the transposed matrix in Reg0 to Reg2, and the elements of the transposed matrix in Reg3 to Reg1, Reg1 The elements of the transposed matrix are scrolled to Reg3, and from this, Table 2-3 can be obtained.
  • the processing element can calculate the second intermediate result a 11 ⁇ b 12 according to the process of step S2-11-step S2-13 , A 12 ⁇ b 22 , a 21 ⁇ b 11 and a 22 ⁇ b 21 .
  • the specific process will not be repeated.
  • the product of the input matrix can be calculated in a block-wise manner.
  • the matrix multiplication operation method according to the present disclosure can realize matrix operations of any size.
  • Example 2-4 Stacked storage combined with overall scrolling
  • step S2-12 in FIG. 2-3 can be implemented through the following process. Before the matrix is scrolled once in the row or column direction, the control processing element multiplies the elements in the corresponding register to obtain the element product, for the elements in the same row (or in the example of transposing the first matrix, for the same column) The sum of the products obtains the first intermediate results C 11 , C 22 , C 33 , C 44 .
  • the original row or column of data is stored in different sets of registers, causing the original row or column of data to be continuously stored into at least two rows or at least two columns of independent data stored in In different groups of registers, the first data of the next row or column of data stored in different groups of registers and the last data of the previous row or column of data are continuously stored data before stacking, but after stacking It is stored discontinuously. Therefore, after controlling the elements in a group of registers to scroll once in the row or column direction, the scrolling result needs to be corrected to get the correct result.
  • the specific correction method can be:
  • the correction method is to scroll the last column of data in each block after scrolling to the last column of the adjacent previous block of data
  • the correction method is to scroll the first column of data in each block after scrolling to the first column of the next adjacent block of data;
  • the correction method is to scroll the last row of data in each block after scrolling to the last row of the adjacent previous block of data
  • the correction method is to scroll the first row of data in each block after scrolling to the first row of the next adjacent block of data.
  • each block mentioned above refers to each transposed matrix
  • each transposed matrix refers to a matrix obtained by transposing each matrix after the block is divided.
  • the right multiplication matrix is transposed, and scrolling is still performed in the row direction during the scrolling process, but due to stacked storage, there are at least two elements between the rows should be continuous, but in When stacked and stored, each row is regarded as an independent row. Only scrolling in the row direction of each group of registers cannot achieve correct scrolling, and it needs to be corrected.
  • Table 2-2 Take Table 2-2 as an example. Within each group of registers, scroll up one row. The results of scrolling are shown in Table 2-4.
  • Table 2-4 the elements in the first row of a group of registers are scrolled to the last row. But as shown in Table 2-2, the elements in the first row of Reg0 and Reg1 should scroll to the last row of Reg2 and Reg3, but they are now located in the last row of Reg0 and Reg1 (as shown in Table 2-4); as shown in Table 2- As shown in 2, the elements in the first row of Reg2 and Reg3 should scroll to the last row of Reg0 and Reg1, but are now located in the last row of Reg2 and Reg3 (as shown in Table 2-4); that is, in Table 2-4 Now the last line of elements of Reg0 and Reg1 should be located in the last line of Reg2 and Reg3, and the last line of elements of Reg2 and Reg3 should be located in the last line of Reg0 and Reg1, then the last line of elements of Reg2 and Reg0 should be exchanged, and the last of Reg3
  • control processing element multiplies the elements in the corresponding register to obtain the element product, and sums the element products of the same row to obtain the first intermediate result C 12 , C 23 , C 34 , C 41 .
  • the calculation process of matrix multiplication can be completed by repeating 4 calculations and 3 scrolling in the above process, and the product of the input matrix can be obtained according to the first intermediate result.
  • the stacked storage method can be stored according to the above block method. It is not limited to each register storing one element in the matrix, and it is not limited to the number of rows and columns multiplied by the matrix is a processing element. The integer multiple of the number of rows and columns is not limited to the only method of stacking storage.
  • the modification process is the same. It only needs to satisfy that the original row/column elements can be connected in series after the modification.
  • the specific stacking storage process There is no restriction here.
  • steps in the flowchart are displayed in sequence as indicated by the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless specifically stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least part of the steps in the flowchart may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The execution of these sub-steps or stages The sequence is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.
  • the present disclosure also provides an arithmetic device based on matrix multiplication of the processing element matrix, and the arithmetic device can be applied to a processor.
  • Figure 2-1 shows an example of a processor.
  • the processor may include more than two processing elements, which are arranged in a two-dimensional matrix, each processing element includes at least one register, and the arithmetic device is used to implement Matrix multiplication of the first matrix and the second matrix.
  • the foregoing device embodiments are only illustrative, and the device of the present disclosure may also be implemented in other ways.
  • the division of units/modules in the above-mentioned embodiments is only a logical function division, and there may be other division methods in actual implementation.
  • multiple units, modules or components may be combined or integrated into another system, or some features may be omitted or not implemented.
  • the functional units/modules in the various embodiments of the present disclosure may be integrated into one unit/module, or each unit/module may exist alone physically, or two or more units/modules may exist.
  • the modules are integrated together.
  • the above-mentioned integrated unit/module can be realized in the form of hardware or software program module.
  • the hardware may be a digital circuit, an analog circuit, and so on.
  • the physical realization of the hardware structure includes but is not limited to transistors, memristors and so on.
  • the register can be any suitable magnetic storage medium or magneto-optical storage medium, such as resistive random access memory (RRAM), dynamic random access memory (DRAM), static Random access memory SRAM (Static Random-Access Memory), enhanced dynamic random access memory EDRAM (Enhanced Dynamic Random Access Memory), high-bandwidth memory HBM (High-Bandwidth Memory), hybrid storage cube HMC (Hybrid Memory Cube), etc. .
  • the integrated unit/module is implemented in the form of a software program module and sold or used as an independent product, it can be stored in a computer readable memory.
  • the technical solution of the present disclosure essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory, A number of instructions are included to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the various embodiments of the present disclosure.
  • the aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
  • the embodiments of the present disclosure also provide a computer-readable storage medium on which computer program instructions are stored, and the computer program instructions implement the above-mentioned method when executed by a processor.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium.
  • the embodiment of the present disclosure also provides an artificial intelligence chip, which includes the processor as described above.
  • a board card which includes a storage device, an interface device, a control device, and the aforementioned artificial intelligence chip; wherein, the artificial intelligence chip is connected to the storage device and the control device. And the interface devices are respectively connected; the storage device is used to store data; the interface device is used to realize data transmission between the artificial intelligence chip and an external device; the control device is used to The state of the artificial intelligence chip is monitored.
  • a processor includes two or more processing elements, the two or more processing elements are arranged in a two-dimensional matrix, the processing element includes at least one register, and the processor is configured to compare the first matrix and the second Two matrices perform matrix multiplication operations,
  • the processor further includes a controller configured to load each element of the transposed matrix of the first matrix and the second matrix into the registers of each processing element, respectively, the transposed matrix and the second matrix The element at the corresponding position is stored in the register of the same processing element;
  • the controller is used to control the transposed matrix or the second matrix to scroll in the row direction or the column direction, and control the processing element to multiply the elements in the corresponding register to obtain the element product, and to obtain the element product of the same row or the same column And get the first intermediate result;
  • the controller is further configured to process the first intermediate result to obtain the product of the first matrix and the second matrix.
  • the controller controls the processing element, the transposed matrix stored in the register, and the second matrix to repeat the following process until the elements in the transposed matrix or the second matrix are restored to their unrolled positions:
  • the controller is used to control the processing element to multiply the elements in the corresponding register to obtain the element product, to sum the element products in the same row or the same column to obtain the first intermediate result, and to control the transposed matrix or the first intermediate result stored in the register.
  • the second matrix scrolls one row or one column in the row direction or the column direction.
  • the controller controls the elements in the transposed matrix to scroll in the row direction, or controls the elements in the second matrix to scroll in the row direction; controls the processing elements to correspondingly Multiply the elements in the register to obtain the element product, and sum the element products of the same column to obtain the first intermediate result;
  • the controller controls the elements in the transposed matrix to scroll in the column direction, or controls the elements in the second matrix to scroll in the column direction; control the processing element
  • the element in the corresponding register is multiplied to obtain the element product, and the element product in the same row is summed to obtain the first intermediate result.
  • the controller stores the first intermediate result in rows or columns, and obtains the product of the first matrix and the second matrix after scrolling in the row direction or the column direction.
  • Clause B5. The processor according to any one of clauses B1-B4, wherein the controller is further configured to determine whether to block the input matrix according to the arrangement of processing elements and the row rank and column rank of the input matrix, wherein the input Matrices include left-multiplying matrix and right-multiplying matrix;
  • the controller splits the rows of the left multiplication matrix or the columns of the right multiplication matrix according to the arrangement of the processing elements;
  • the controller blocks the column direction of the left multiplication matrix and the row direction of the right multiplication matrix in the same way according to the arrangement of the processing elements and the row rank and column rank of the input matrix.
  • two or more first matrices are obtained, and two or more second matrices are obtained after the right multiplication matrix is divided into blocks, or two or more second matrices are obtained after the left multiplication matrix is divided into blocks.
  • Two matrices, after multiplying the matrix into blocks on the right, two or more first matrices are obtained.
  • the controller is further configured to calculate the product of the left multiplication matrix and the right multiplication matrix according to the product of the first matrix and the second matrix.
  • Clause B7 The processor of clause B5, the processor comprising multiple sets of registers,
  • the controller is further configured to transpose two or more of the first matrices to obtain a transposed matrix after dividing the input matrix into blocks;
  • the controller loads the transposed matrix and two or more of the second matrices into the multiple sets of registers for stack storage, and a set of registers stores the transposed matrix and the second matrix at corresponding positions;
  • the controller controls the processing element to multiply the elements in the corresponding register to obtain the element product.
  • the element product summation obtains the first intermediate result
  • the controller After controlling the elements in a set of registers to scroll one row or one column of the transposed matrix in the row or column direction, the controller also corrects the scrolling result.
  • the modification of the rolling result includes:
  • the correction method is to scroll the last column of data in each transposed matrix after scrolling to the last column of the adjacent previous transposed matrix data;
  • the correction method is to scroll the first column of data in each block of transposed matrix after scrolling to the first column of the next adjacent block of transposed matrix data;
  • the correction method is to scroll the last row of data in each transposed matrix after scrolling to the last row of the adjacent previous transposed matrix data
  • the correction method is to scroll the first row of data in each block of transposed matrix after scrolling to the first row of the next adjacent block of transposed matrix data;
  • each block of the transposed matrix refers to the matrix after each block of the matrix is transposed.
  • Transpose the first matrix to obtain a transposed matrix load each element of the transposed matrix and the second matrix into the registers of each processing element, respectively, and the transposed matrix and the second matrix correspond to positions The elements of are stored in the registers of the same processing element;
  • the first intermediate result is processed to obtain the product of the first matrix and the second matrix.
  • Item B10 According to the operation method described in item B9, control the transposed matrix or the second matrix to scroll in the row direction or column direction, and control the processing element to multiply the elements in the corresponding register to obtain the element product, and combine the same row Or the first intermediate result is obtained by summing the product of the elements in the same column, including repeating the following process until the elements in the transposed matrix or the second matrix are restored to their unrolled positions:
  • the control processing element multiplies the elements in the corresponding register to obtain the element product, and sums the element products in the same row or the same column to obtain the first intermediate result.
  • the transposed matrix or the second matrix is in the row Scroll one row or column in the direction or column direction.
  • the processing element When the first matrix is a left-multiplying matrix and the second matrix is a right-multiplying matrix, control the elements in the transposed matrix to scroll in the row direction, or control the elements in the second matrix to scroll in the row direction; control the processing element to perform the corresponding register Multiply the elements within to obtain the element product, and sum the element products of the same column to obtain the first intermediate result;
  • the first matrix is a right-multiplying matrix and the second matrix is a left-multiplying matrix
  • control the elements in the transposed matrix to scroll in the column direction or control the elements in the second matrix to scroll in the column direction; control the corresponding processing elements
  • the element in the register of is multiplied to obtain the element product, and the element product of the same row is summed to obtain the first intermediate result.
  • processing the first intermediate result to obtain the product of the first matrix and the second matrix includes:
  • the first intermediate result is stored in rows or columns, and the product of the first matrix and the second matrix is obtained after scrolling in the row direction or the column direction.
  • Clause B13 The method according to any one of clauses B9-B12, the method further comprising:
  • two or more first matrices are obtained, and two or more second matrices are obtained after the right multiplication matrix is divided into blocks, or two or more second matrices are obtained after the left multiplication matrix is divided into blocks.
  • Two matrices, after multiplying the matrix into blocks on the right, two or more first matrices are obtained.
  • Clause B14 The method according to Clause B13, the method further comprising:
  • the product of the left multiplication matrix and the right multiplication matrix is calculated according to the product of the first matrix and the second matrix.
  • the method also includes:
  • control processing element multiplies the elements in the corresponding register to obtain the element product, and multiplies the element product in the same row or column Summing to get the first intermediate result;
  • the scrolling result is corrected.
  • the correction method is to scroll the last column of data in each transposed matrix after scrolling to the last column of the adjacent previous transposed matrix data;
  • the correction method is to scroll the first column of data in each block of transposed matrix after scrolling to the first column of the next adjacent block of transposed matrix data;
  • the correction method is to scroll the last row of data in each transposed matrix after scrolling to the last row of the adjacent previous transposed matrix data
  • the correction method is to scroll the first row of data in each block of transposed matrix after scrolling to the first row of the next adjacent block of transposed matrix data;
  • each block of the transposed matrix refers to the matrix after each block of the matrix is transposed.
  • Clause B17 An artificial intelligence chip comprising the processor according to any one of clauses B1-B8.
  • Clause B18 An electronic device including the artificial intelligence chip as described in Clause B17.
  • Matrix operation occupies a relatively large amount of calculation in the process of using artificial intelligence to process information, and the existing processor disassembles the matrix operation into multiplication and addition operations in the process of processing matrix operations, which requires frequent operations. Reading data from the memory is very inefficient.
  • multi-stage pipelines are usually used to implement the operation process.
  • each stage processes part of the input data
  • the multi-stage pipeline Therefore, data needs to be read from the memory frequently, and frequent access to the memory leads to higher bandwidth requirements.
  • the present disclosure provides an operation method and a processor for executing the operation method.
  • the processor may include multiple processing elements.
  • the multiple processing elements may be arranged in a two-dimensional matrix to better adapt to matrix operations.
  • Figure 3-1 shows a schematic diagram of a processor according to an embodiment of the present disclosure.
  • the processor includes multiple processing elements PE (Processing Element) arranged in a two-dimensional matrix, and each processing element is connected to adjacent processing elements.
  • Each PE can be provided with at least A register (not shown in the figure).
  • the processor can load the elements of the matrix into the register corresponding to each PE, and the processor can control the PE to perform operations on the elements stored in the register set in the PE.
  • the processor may also include a controller and a memory, where both the controller and the memory are connected to multiple processing elements, and the controller may be connected to the memory.
  • the controller is used to load input data from the memory to the register of the processing element, and control the processing element to process the input data.
  • the memory may store the first matrix and the second matrix (or the left multiplying matrix and the right matrix).
  • Multiplication matrix the processor is used to perform matrix multiplication operations on the first matrix and the second matrix. Therefore, the controller can load the first matrix and the second matrix into the register of the processing element and control the processing element to perform the matrix multiplication operation.
  • an executable program may also be stored in the memory, and the executable program may include instructions, and the execution instructions may implement matrix multiplication operations on the first matrix and the second matrix.
  • the controller can be provided with a loader, a decoder, etc., where the loader can be used to load the input data in the memory into the register of the processing element, and the decoder can execute the executable according to the storage address of the input data after loading.
  • the instructions for accessing data are decoded. For example, for the instructions for accessing data, the address stored in the register of the input data obtained by decoding is assigned to the instruction for accessing data, and the decoded instruction is sent to the processing element.
  • the instruction is executed by the processing element, thereby realizing the processing of the data, for example, realizing the matrix multiplication operation of the first matrix and the second matrix.
  • the memory may be an on-chip cache
  • the controller may load the executable program on the off-chip flash memory and input data (for example, the input matrix, including the left multiplication matrix and the right multiplication matrix) into the above-mentioned memory ( In the on-chip cache), the subsequent matrix multiplication process is performed.
  • the controller can also directly load the input matrix and the executable program from the off-chip memory to the register of the processing element, which is not limited in the present disclosure.
  • the PE may also include an arithmetic unit to complete the specified operation. Taking matrix operation as an example, the PE may include, for example, a multiplier, an adder, etc.
  • the specific structure of each PE may be the same or different, and this disclosure will not make this limited.
  • the PE may also include other types of arithmetic units to adapt to various different arithmetic processes. The present disclosure does not limit the number and types of arithmetic units included in the PE.
  • the processor can also preprocess the input data to obtain and preprocess the input data, load the preprocessed input data into the register of the processing element, and control the processing element Perform operations on the preprocessed input data.
  • the input matrix of the multiplication operation may include a left multiplication matrix and a right multiplication matrix, where the left multiplication matrix may refer to the matrix located on the left side of the multiplication sign, and the right multiplication matrix may refer to the matrix located on the right side of the multiplication sign.
  • the controller can first determine whether to block the input matrix according to the arrangement of processing elements and the row rank and column rank of the input matrix . Performing operations on each block of the matrix to obtain the first intermediate result, and the controller may control the processing element to calculate the product of the input matrix according to the first intermediate result.
  • the arrangement of the processing elements may refer to the number of rows and columns of the processing elements, and the row rank and column rank of the input matrix may refer to the number of rows and columns of the left multiplication matrix and the right multiplication matrix.
  • Determining whether to block the input matrix according to the arrangement of the processing elements and the row rank and column rank of the input matrix may mean that the controller can determine whether the number of rows of the input matrix is greater than the number of rows and the number of columns of the processing element. The number of columns. Determine whether to block the input matrix according to the result of the judgment.
  • the controller may not block the input matrix.
  • the controller may block the input matrix.
  • the array of processing elements is an M ⁇ N matrix, which can be expressed as PE MN
  • an input matrix is an m ⁇ n matrix, which can be expressed as A mn
  • the other input matrix is an n ⁇ k matrix , Can be expressed as B nk .
  • a mn is not greater than the number of rows M of processing elements
  • the number of columns n is not greater than the number of columns N of processing elements
  • the number of rows n of B nk is not greater than the number of rows M of processing elements.
  • the controller may not block the input matrix.
  • the controller can block the input matrix.
  • the input matrix is to be divided into blocks, it is assumed that more than two first matrices can be obtained after the left multiplication matrix is divided into blocks, and more than two second matrices can be obtained after the right multiplication matrix is divided into blocks.
  • control The controller can determine to block the left multiplication matrix in the input matrix, and the number of columns of the right multiplication matrix is greater than the number of columns of the processing element, then the controller can determine to block the right multiplication matrix; if you want to block the left multiplication matrix , The controller can split the rows of the left multiplication matrix according to the arrangement of the processing elements. If the right multiplication matrix is to be divided into blocks, the controller can split the columns of the right multiplication matrix according to the arrangement of the processing elements.
  • the controller can block both matrices in the input matrix.
  • the controller needs to block both matrices; if you want to block both matrices in the input matrix, the controller can arrange according to the processing elements And the row rank and column rank of the input matrix block the column direction of the left multiplication matrix and the row direction of the right multiplication matrix in the same way.
  • the left multiplication matrix is A 32
  • the right multiplication matrix is B 22
  • the left multiplication matrix A 32 can be split into matrix A 12 and matrix A 22 respectively Multiply by the right multiplication matrix B 22
  • the left multiplication matrix is A 22
  • the right multiplication matrix is B 23
  • the right multiplication matrix B 23 can be split into a matrix B 21 and a matrix B 22 .
  • the controller may block in the same manner in the column direction of the left multiplication matrix and in the row direction of the right multiplication matrix, wherein the same manner
  • the division means that the number of columns of the first matrix obtained after division is the same as the number of rows of the corresponding second matrix, so as to ensure that the matrix operation can be completed normally.
  • the column direction of the left multiplication matrix and the row direction of the right multiplication matrix are divided in the same way.
  • the condition for further block division is required, that is, the number of rows of the first matrix and the second matrix are not greater than the number of rows of the processing element, and the number of columns is not greater than the number of columns of the processing element.
  • the division can be performed in such a way that the row rank and column rank of the divided first matrix or the second matrix are as close as possible to the number of rows and columns of the processing element, which can improve the efficiency of the operation and shorten the Operation time. That is to say, assuming that the processing element is a 4 ⁇ 4 array, it can be divided first according to the way that the divided matrix is 4 ⁇ 4, so that the processing element can be used with maximum efficiency and the calculation efficiency can be improved.
  • Figures 3-2a and 3-2b respectively show a variety of different ways of dividing.
  • the matrix A 24 is divided into blocks in the same manner in the column direction and the matrix B 43 is divided into blocks in the row direction.
  • Figure 3-2a is an example of the division.
  • Matrix A 24 is divided into two parts in the column direction, each part includes two columns, and matrix B 43 is divided into two parts in the row direction, and each part includes two rows, including those in Figure 3-2a.
  • Figure 3-2b is another example of division.
  • Matrix A 24 is divided into three parts in the column direction.
  • Matrix B 43 is in The row direction is divided into three parts, one part includes two lines, and the other two parts both include one line.
  • the above arrangement of processing elements and the division of the input matrix are merely an example of the present disclosure, and do not limit the present disclosure in any way.
  • the present disclosure does not make specific restrictions on the division of the row direction of the left-multiplying matrix and the column direction of the right-multiplying matrix, as long as the divided matrices meet the condition that no more block is required.
  • Fig. 3-3 shows a flowchart of an operation method according to an embodiment of the present disclosure.
  • the controller can also directly use the left multiplication matrix as the first matrix and the right multiplication matrix as the second matrix.
  • the method shown in FIG. 3-3 may be executed by the controller in the processor or executed by the processing element controlled by the controller.
  • the calculation method provided by the present disclosure may include the following steps:
  • Step S3-31 preprocessing the first matrix and the second matrix to obtain the third matrix and the fourth matrix, wherein the elements at the corresponding positions of the third matrix and the fourth matrix are stored in the register of the same processing element.
  • the rank and the row rank of the second matrix are k, and max(m,k,n) means to take the maximum of m, k, and n;
  • Step S3-32 scroll the third matrix and the fourth matrix in the row direction or the column direction, and control the processing element to perform multiplication operations on the elements in the corresponding registers to obtain the element product matrix;
  • Step S3-33 processing the element product matrix according to the way of preprocessing the first matrix and the second matrix to obtain the product of the first matrix and the second matrix.
  • the preprocessing can include: first preprocessing and second preprocessing.
  • the first preprocessing can refer to expanding the first matrix and the second matrix, and the second preprocessing can refer to rolling elements in the expanded matrix. .
  • the controller can use 0 to expand the first matrix and the second matrix. Specifically, assuming that the first matrix is m ⁇ k and the second matrix is k ⁇ n, the controller can determine m, The maximum value p of the three of k and n is then expanded with 0 on the lower side and/or right side of the first matrix and the second matrix to form a p ⁇ p matrix.
  • step S3-32 may include the following process:
  • Step S3-321 the controller controls the processing element to perform multiplication operations on the elements in the corresponding registers to obtain the first element product matrix
  • Step S3-322 the controller repeats (p-1) the following process: scroll the third matrix as a whole by one step to the left, scroll the fourth matrix as a whole by one step, or scroll the third matrix as a whole by one step to the right, and The four matrices are scrolled down one step as a whole, and the processing element is controlled to perform multiplication operations on the elements in the corresponding registers to obtain the second element product matrix.
  • the controller may control the processing element to multiply the elements in the corresponding registers to obtain the first element product matrix.
  • the controller can repeat the following process p-1 times: scroll the third matrix as a whole by one step to the left, and scroll the fourth matrix as a whole by one step, and control the processing element to multiply the elements in the corresponding register to obtain the second element product Matrix; or repeat the following process p-1 times: scroll the third matrix as a whole by one step to the right, and scroll the fourth matrix as a whole by one step, and control the processing element to multiply the elements in the corresponding register to obtain the second element product matrix .
  • the controller can control the processing element to calculate the p-1 second element product matrix.
  • the corresponding second preprocessing process can be "the i-th matrix of the expanded first matrix The row is scrolled to the left by i steps, and the j-th column of the expanded second matrix is scrolled up by j steps, where i and j are natural numbers, and 0 ⁇ i ⁇ p-1, 0 ⁇ j ⁇ p-1", and for In step S3-322, each time the third matrix is scrolled to the right by one step and the fourth matrix is scrolled down by one step, the corresponding second preprocessing process can be "the i-th matrix of the expanded first matrix Scroll the row to the left by i step, and then scroll to the right by 1 step, scroll up the j-th column of the expanded second matrix by j steps, and then scroll down by 1 step", or "the expanded first matrix Scroll the i-th row to the left by i-1 step, and scroll the j-th column of the
  • a closed loop can be formed between the processing elements that store the elements of the matrix. Since adjacent processing elements are connected together, the controller can determine the loop according to the dimension of the matrix. For example, if you want to scroll in the column direction, then the first row of processing elements that store the elements of the matrix and the last row of processing elements are connected. During the scrolling process, if you scroll up, then the first row of elements of the matrix Scroll from the original storage location to the storage location of the last row of elements. If you want to scroll in the row direction, then the first column of processing elements and the last column of processing elements that store the elements of the matrix are connected. The position scrolls to the position where the last column element is stored.
  • the above-mentioned connection between the processing element and the processing element may refer to a virtual connection, that is, there is no actual connection line, but the controller records the corresponding processor and forms a closed loop during the scrolling process.
  • the preprocessing of the first matrix and the second matrix may also include a loading process.
  • the loading process may be performed before the first preprocessing and the second preprocessing, or may be performed in the first preprocessing.
  • the processing and the second pre-processing are performed afterwards. That is to say, in the embodiment of the present disclosure, the first matrix and the second matrix can also be loaded into the register of the processing element, and then the first matrix and the second matrix are subjected to the first preprocessing and the second preprocessing.
  • the process of obtaining the third matrix and the fourth matrix can also be completed outside the controller to obtain the third matrix and the fourth matrix after the first preprocessing and the second preprocessing of the first matrix and the second matrix, and then the third matrix
  • the matrix and the fourth matrix are loaded into the register of the processing element, which is not limited in the present disclosure.
  • step S3-33 may include: summing the first element product matrix and a plurality of second element product matrices to obtain a fifth matrix, and according to the manner of preprocessing the first matrix and the second matrix The fifth matrix is processed to obtain the matrix product.
  • the fifth matrix may be processed according to the process of the first preprocessing, for example, in the first One matrix and the second matrix add elements 0 to the right and lower sides to form a p ⁇ p matrix.
  • the post-processing of the fifth matrix can be reverse expansion on the right and lower sides of the fifth matrix, for example, the fifth matrix
  • the elements 0 on the right and lower sides of the matrix are removed to form an m ⁇ n matrix.
  • the matrix multiplication operation does not require disassembly operation and repeated reading of data, reducing the number of times to read the memory, reducing bandwidth pressure, and high operation efficiency.
  • the input matrix can be transformed by preprocessing, and then the operation can be performed to obtain the result of the matrix multiplication.
  • the first matrix and the second matrix can be loaded into the register of the processing element, and then the first preprocessing process is performed: the first matrix is expanded to Extend the second matrix to
  • the elements of the first row and the first column of the first matrix and the second matrix can be loaded into the register of the same processing element during loading.
  • the first matrix may be loaded into the first set of registers Reg0 of the processing element
  • the second matrix may be loaded into the second set of registers Reg1 of the processing element.
  • each box in Reg0 can represent a register in a different processing element
  • each box in Reg1 can represent a register in a different processing element.
  • a 11 and B 11 are stored in the register of the same processing element.
  • the first group of registers or the second group of registers herein may refer to a layer of registers physically divided into different layers, or may be a group of registers divided logically, which is not limited in the present disclosure.
  • the controller can also connect the processing elements in the row direction or the column direction to form a closed loop. For example, it can connect the processing elements of the first row element and the last row element of the expanded first matrix and the second matrix in the column direction, A ring is formed, and the data in the ring can flow to realize the scrolling of the matrix in the column direction. Or it is also possible to connect the processing elements of the first column elements and the processing elements of the last column elements of the expanded first matrix and the second matrix in the row direction to form a ring, and the data in the ring can flow to realize the matrix in the row side Scroll up.
  • PE 11 and PE 31 may be connected to form a closed loop
  • PE 12 and PE 32 may be connected to form a closed loop
  • PE 13 and PE 33 may be connected to form a closed loop.
  • PE 11 and PE 13 it is also possible to connect PE 11 and PE 13 to form a closed loop, connect PE 21 and PE 23 to form a closed loop, and connect PE 31 and PE 33 to form a closed loop.
  • the data in the first column will flow to the third column
  • the data in the second column will flow to the first column
  • the data in the third column will flow to the The second column; if it is flowing to the right, then the data in the first column will flow to the second column, the data in the second column will flow to the third column, and the data in the third column will flow to the first column.
  • the second preprocessing process In an example (example 3-1), for matrix a 33 , the controller does not need to scroll the 0th row, and controls the elements in the 1st row to scroll to the left by 1 step and 2nd.
  • the third matrix obtained by scrolling the elements of the row to the left for 2 steps is as follows:
  • the controller does not need to scroll the 0th column, and controls the elements in the 1st column to scroll up by 1 step, and the elements in the 2nd column scroll up by 2 steps to obtain the fourth matrix as follows:
  • example 3-2 For the second preprocessing process: In another example (example 3-2), for matrix a 33 , the controller does not need to scroll the 0th row, and controls the elements of the 1st row to scroll to the left by 1 step, The elements in the second row are scrolled to the left by 2 steps in turn, and then the elements in the matrix are controlled to scroll to the right by 1 step.
  • the third matrix is obtained (or the controller controls the 0th row to scroll to the right by 1 step, and controls the first row of elements Do not scroll, control the second row of elements to scroll 1 step to the left) as follows:
  • the controller does not need to scroll the 0th column, and controls the elements in the 1st column to scroll up by 1 step, the 2nd column scrolls up by 2 steps, and then scroll down as a whole for the first step obtained by 1 step.
  • the four matrices are as follows:
  • the third matrix and the fourth matrix may be loaded into the register of the processing element. Just load the elements of the third matrix and the fourth matrix at the corresponding positions into the register of the same processing element. There is no need to transpose the third matrix and the fourth matrix, that is, the third matrix and the fourth matrix
  • the four matrices are loaded into the registers of the processing element in a row-column aligned manner.
  • the third matrix may be loaded into the first set of registers Reg0 of the processing element, and the fourth matrix may be loaded into the second set of registers Reg1 of the processing element.
  • each box in Reg0 can represent a register in a different processing element
  • each box in Reg1 can represent a register in a different processing element, as shown in Figure 3-1, combined with the example 3-1 described above
  • the storage location of the element A 11 and the element B 11 may be the register in the processing element PE 11
  • the storage location of the element A 12 and the element B 22 may refer to the processing element PE 12
  • the storage location of element A 21 and element B 13 may refer to the register in processing element PE 23...
  • the first group of registers or the second group of registers herein may refer to a layer of registers physically divided into different layers, or may be a group of registers divided logically, which is not limited in the present disclosure.
  • this embodiment is only an example of the present disclosure, and does not limit the present disclosure in any way, as long as the third matrix and the fourth matrix are loaded into the register of the processing element in a row-column aligned manner.
  • the control processing element multiplies the elements in the corresponding register to obtain the first element product matrix, which can be as follows:
  • step S3-32 still taking Example 3-1 as an example, scroll the third matrix one step to the left to get
  • the control processing element performs multiplication operations on the elements in the corresponding registers to obtain the second element product matrix.
  • the second element product matrix can be as follows:
  • p 3 and p-1 is 2. Therefore, it is necessary to scroll the third matrix one step to the left and the fourth matrix one step upward.
  • the control processing element multiplies the elements in the corresponding register to obtain the second element product matrix
  • step S3-33 the first element product matrix and multiple second element product matrices are summed to obtain the fifth matrix
  • the first element product matrix and multiple second element product matrices calculated in the foregoing process may be temporarily stored in a temporary buffer.
  • the first element product matrix and multiple second element product matrices can also be stored in the register of the processing element, for example, stored in Reg2, Reg3, Reg4 (other sets of registers of the processing element), and each processing element
  • the elements stored in the corresponding registers can be added to realize the process of summing the first element product matrix and the multiple second element product matrices. It should be noted that the above are only some examples of calculating the fifth matrix in the present disclosure, and do not limit the present disclosure in any way.
  • the calculation method of matrix multiplication according to the foregoing embodiments of the present disclosure is more suitable for a processor composed of processing elements arranged in an array, and the calculation efficiency is high. And for an input matrix of any scale that satisfies the arrangement of the processing elements, the input matrix can be transformed by preprocessing, and then the calculation can be performed to obtain the calculation result of the matrix multiplication. Moreover, compared with the matrix multiplication operation in the related technology, the number of memory accesses can be reduced, the bandwidth pressure can be reduced, and the efficiency of the operation can be improved.
  • the result of matrix multiplication can be directly obtained according to the above example.
  • the result of multiplying the first matrix and the corresponding second matrix according to the matrix multiplication rule is used as the first intermediate result, that is to say
  • the first matrix and the second matrix obtained after block division can be used as an element of the matrix to perform the operation process of matrix multiplication to obtain the first intermediate result, and the product of the input matrix can be obtained by calculation according to the first intermediate result.
  • Figures 3-4 show a schematic diagram of block division according to an embodiment of the present disclosure.
  • the matrices D and E are divided into blocks in the manner described above to obtain the first matrix D 11 , D 12 , D 21 , D 22 , and the second matrix E 11 , E 12 , E 21 , E 22 .
  • the first matrix and the second matrix can be used as an element of the matrix to perform the operation process of matrix multiplication.
  • the process of obtaining the first intermediate result can be obtained by performing calculations on the corresponding first matrix and second matrix respectively according to the process of step S3-31 to step S3-34.
  • the input matrix is divided into blocks, and the matrix multiplication operation of the present disclosure is performed on the divided matrix to obtain the first intermediate result, and the product of the input matrix can be calculated according to the first intermediate result.
  • the process of matrix multiplication can be quickly realized for any dimension of the matrix.
  • the number of memory accesses can be reduced, the bandwidth pressure is reduced, and the efficiency of calculations can be improved.
  • a 11 is a 12 is b 11 is b 21 is b 12 is b 22 is
  • step S3-31 since both the matrix a 11 and the matrix a 12 are 2 ⁇ 2 matrices, no expansion is required.
  • the second preprocessing process can be that for matrix a 11 , the controller does not need to scroll the 0th row, and controls the elements of the 1st row to scroll to the left by 1 step, and the third matrix obtained is as follows:
  • the controller does not need to scroll the 0th column, and controls the elements in the 1st column to scroll up by 1 step to obtain the fourth matrix as follows:
  • the elements at the corresponding positions of the third matrix and the fourth matrix are stored in the register of the same processing element.
  • the third matrix is stored in the first set of registers Reg0 of the processing element
  • the fourth matrix is stored in the second set of registers Reg1 of the processing element.
  • the storage location of element A 11 and element B 11 may refer to the register in processing element PE 11
  • the storage location of element A 12 and element B 22 may refer to the register in processing element PE 12
  • the storage of element A 22 and element B 21 The position of can refer to the register in the processing element PE 21.
  • the control processing element multiplies the elements in the corresponding register to obtain the first element product matrix, which can be as follows:
  • step S3-32 still taking Example 3-1 as an example, scroll the third matrix one step to the left to get
  • the control processing element performs multiplication operations on the elements in the corresponding registers to obtain the second element product matrix.
  • the second element product matrix can be as follows:
  • the fifth matrix is obtained by summing the first element product matrix and the second element product matrix
  • step S3-31-step S3-33 can be used to obtain the first intermediate result, and then the product of the input matrix can be calculated according to the first intermediate result,
  • the calculation process is:
  • C 12 a 11 ⁇ b 12 +a 12 ⁇ b 22 .
  • the above is the calculation method of matrix multiplication according to various embodiments of the present disclosure. According to the above process, the product of the input matrix can be calculated in a block manner. Therefore, the matrix multiplication operation method according to the present disclosure can realize matrix operations of any size.
  • the present disclosure also provides a processor.
  • Figure 3-1 shows an example of a processor.
  • the processor may include more than two processing elements, which are arranged in a two-dimensional matrix, and each processing element includes at least one register. Matrix multiplication of the first matrix and the second matrix.
  • the processor also includes a controller for preprocessing the first matrix and the second matrix to obtain the third matrix and the fourth matrix, wherein the elements at the corresponding positions of the third matrix and the fourth matrix are stored in the same
  • m represents the row rank of the first matrix
  • n represents the column rank of the second matrix
  • the column rank of one matrix and the row rank of the second matrix are k
  • p is the maximum of m, k, and n;
  • the controller is used to scroll the third matrix and the fourth matrix in the row direction or the column direction, and control the processing element to perform multiplication operations on the elements in the corresponding registers to obtain the element product matrix;
  • the controller is used for processing the element product matrix according to the preprocessing method of the first matrix and the second matrix to obtain the product of the first matrix and the second matrix.
  • the controller is further configured to control the processing element to multiply the elements in the corresponding register to obtain the first element product matrix
  • the controller repeats the following process p-1 times: scroll the third matrix as a whole to the left once, scroll the fourth matrix as a whole once, or scroll the third matrix as a whole to the right once, and scroll the fourth matrix as a whole down Scroll once, and control the processing element to multiply the elements in the corresponding register to obtain the second element product matrix.
  • the controller is configured to sum the first element product matrix and the second element product matrix to obtain a fifth matrix, and perform processing on the fifth matrix according to the manner of preprocessing the first matrix and the second matrix. Processing is performed to obtain the product of the first matrix and the second matrix.
  • the preprocessing of the first matrix and the second matrix by the controller includes: a first preprocessing and a second preprocessing
  • the first preprocessing refers to: using 0 to expand the right side and/or the lower side of the first matrix and the second matrix to obtain a p ⁇ p matrix;
  • the second preprocessing refers to: scrolling the elements in the expanded p ⁇ p matrix.
  • the corresponding second preprocessing process is: the i-th row of the expanded first matrix Scroll i step to the left, scroll up the j-th column of the expanded second matrix by j steps, where i and j are natural numbers, and 0 ⁇ i ⁇ p-1, 0 ⁇ j ⁇ p-1.
  • the corresponding second preprocessing process is: The row is scrolled to the left by i-1 step, and the j-th column of the expanded second matrix is scrolled up by j-1 step.
  • the controller is further configured to determine whether to block the input matrix according to the arrangement of the processing elements and the row rank and column rank of the input matrix, where the input matrix includes a left multiplication matrix and a right multiplication matrix. matrix;
  • the controller splits the rows of the left multiplication matrix according to the arrangement of the processing elements.
  • the controller divides the columns of the right multiplication matrix according to the arrangement of the processing elements Split
  • the controller blocks the column direction of the left multiplication matrix and the row direction of the right multiplication matrix in the same way according to the arrangement of the processing elements and the row rank and column rank of the input matrix.
  • two or more first matrices are obtained, and two or more second matrices are obtained after the right multiplication matrix is divided into blocks, or two or more second matrices are obtained after the left multiplication matrix is divided into blocks.
  • Two matrices, after multiplying the matrix into blocks on the right, two or more first matrices are obtained.
  • the controller determines to block the left multiplication matrix, and the number of columns of the right multiplication matrix is greater than the number of columns of the processing element, then the controller determines to block the right multiplication matrix;
  • the controller blocks both matrices in the input matrix.
  • the controller is further configured to calculate the product of the left multiplication matrix and the right multiplication matrix according to the product of the first matrix and the second matrix according to the rule of matrix multiplication.
  • the embodiment of the present disclosure also provides an artificial intelligence chip, which includes the processor as described above.
  • the embodiment of the present disclosure also provides an arithmetic device including the above-mentioned processor.
  • a board card which includes a storage device, an interface device, a control device, and the aforementioned artificial intelligence chip; wherein, the artificial intelligence chip is connected to the storage device and the control device. And the interface devices are respectively connected; the storage device is used to store data; the interface device is used to realize data transmission between the artificial intelligence chip and an external device; the control device is used to The state of the artificial intelligence chip is monitored.
  • the element product matrix is processed according to the manner of preprocessing the first matrix and the second matrix to obtain the product of the first matrix and the second matrix.
  • Clause C2 According to the method described in Clause C1, scroll the third matrix and the fourth matrix in the row direction or the column direction, and control the processing element to multiply the elements in the corresponding registers to obtain the element product matrix, including:
  • the control processing element performs multiplication operations on the elements in the corresponding registers to obtain the first element product matrix
  • processing the element product matrix according to the method of preprocessing the first matrix and the second matrix to obtain the product of the first matrix and the second matrix includes:
  • the fifth matrix is obtained by summing the first element product matrix and the second element product matrix, and the fifth matrix is processed according to the manner of preprocessing the first matrix and the second matrix to obtain the product of the first matrix and the second matrix.
  • Clause C4 The method according to clause C1, wherein the preprocessing of the first matrix and the second matrix to obtain the third matrix and the fourth matrix includes: including the first preprocessing and the second preprocessing,
  • the first preprocessing refers to: using 0 to expand the right side and/or the lower side of the first matrix and the second matrix to obtain a p ⁇ p matrix;
  • the second preprocessing refers to: scrolling the elements in the expanded p ⁇ p matrix.
  • the corresponding second preprocessing process is: scroll the i-th row of the expanded first matrix to the left by i steps, and the expanded The j-th column of the second matrix is scrolled up by j steps, where i and j are natural numbers, and 0 ⁇ i ⁇ p-1, 0 ⁇ j ⁇ p-1.
  • the corresponding second preprocessing process is: scroll the i-th row of the expanded first matrix to the left by step i-1, Scroll up the j-th column of the expanded second matrix by j-1 steps.
  • two or more first matrices are obtained, and two or more second matrices are obtained after the right multiplication matrix is divided into blocks, or two or more second matrices are obtained after the left multiplication matrix is divided into blocks.
  • Two matrices, after multiplying the matrix into blocks on the right, two or more first matrices are obtained.
  • Clause C8 According to the method described in Clause C7, determining whether to block the input matrix according to the arrangement of processing elements and the row rank and column rank of the input matrix, including:
  • the left multiplication matrix is determined to be divided into blocks If the number of columns of the right multiplication matrix is greater than the number of columns of the processing element, it is determined to block the right multiplication matrix;
  • Clause C9 The method according to clause C7, the method further comprising: calculating the product of the left multiplication matrix and the right multiplication matrix according to the product of the first matrix and the second matrix according to the rule of matrix multiplication.
  • a processor includes two or more processing elements, the two or more processing elements are arranged in a two-dimensional matrix, the processing element includes at least one register, and the processor is configured to compare the first matrix and the second In the matrix multiplication operation of two matrices, the processor further includes a controller for preprocessing the first matrix and the second matrix to obtain the third matrix and the fourth matrix, wherein the third matrix and the fourth matrix The element at the corresponding position is stored in the register of the same processing element.
  • the column rank of the second matrix, the column rank of the first matrix and the row rank of the second matrix are k, and p is the maximum of m, k, and n;
  • the controller is used to scroll the third matrix and the fourth matrix in the row direction or the column direction, and control the processing element to perform multiplication operations on the elements in the corresponding registers to obtain the element product matrix;
  • the controller is used for processing the element product matrix according to the preprocessing method of the first matrix and the second matrix to obtain the product of the first matrix and the second matrix.
  • the controller repeats p-1 times to scroll the third matrix as a whole to the left once and the fourth matrix as a whole to scroll up once, or to scroll the third matrix as a whole to the right once and scroll the fourth matrix as a whole down once,
  • the control processing element performs a multiplication operation on the elements in the corresponding register to obtain a second element product matrix.
  • Clause C12 The processor according to Clause C11, wherein the controller is configured to sum the first element product matrix and the second element product matrix to obtain a fifth matrix, and perform a preprocessing on the first matrix and the second matrix.
  • the fifth matrix is processed to obtain the product of the first matrix and the second matrix.
  • Clause C13 The processor according to clause C10, wherein the pre-processing of the first matrix and the second matrix by the controller includes: a first pre-processing and a second pre-processing,
  • the first preprocessing refers to: using 0 to expand the right side and/or the lower side of the first matrix and the second matrix to obtain a p ⁇ p matrix;
  • the second preprocessing refers to: scrolling the elements in the expanded p ⁇ p matrix.
  • Clause C14 According to the processor of Clause C13, for the method of scrolling the third matrix as a whole to the left and the fourth matrix as a whole, the corresponding second preprocessing process is: The i-th row is scrolled to the left by i steps, and the j-th column of the expanded second matrix is scrolled up by j steps, where i and j are natural numbers, and 0 ⁇ i ⁇ p-1, 0 ⁇ j ⁇ p-1.
  • Clause C15 According to the processor of Clause C13, for the method of scrolling the third matrix as a whole to the right and scrolling the fourth matrix as a whole, the corresponding second preprocessing process is: the expanded first matrix Scroll the i-th row to the left by i-1 steps, and scroll the j-th column of the expanded second matrix up by j-1 steps.
  • the controller is further configured to determine whether to block the input matrix according to the arrangement of the processing elements and the row rank and column rank of the input matrix, where the input matrix includes a left multiplication matrix and a right multiplication matrix;
  • the controller splits the rows of the left multiplication matrix according to the arrangement of the processing elements.
  • the controller divides the columns of the right multiplication matrix according to the arrangement of the processing elements Split
  • the controller blocks the column direction of the left multiplication matrix and the row direction of the right multiplication matrix in the same way according to the arrangement of the processing elements and the row rank and column rank of the input matrix.
  • two or more first matrices are obtained, and two or more second matrices are obtained after the right multiplication matrix is divided into blocks, or two or more second matrices are obtained after the left multiplication matrix is divided into blocks.
  • Two matrices, after multiplying the matrix into blocks on the right, two or more first matrices are obtained.
  • Clause C17 The processor according to Clause C16, if the number of columns of the left multiplication matrix is not greater than the number of columns of processing elements, the number of rows of the right multiplication matrix is not greater than the number of rows of processing elements, and the number of rows of the left multiplication matrix is greater than the number of processing elements If the number of rows of the controller determines to block the left multiplication matrix, and the number of columns of the right multiplication matrix is greater than the number of columns of the processing element, the controller determines to block the right multiplication matrix;
  • the controller blocks both matrices in the input matrix.
  • Clause C18 The processor according to clause C16, wherein the controller is further configured to calculate the value of the left multiplication matrix and the right multiplication matrix according to the product of the first matrix and the second matrix according to the rule of matrix multiplication. product.
  • Fig. 4 shows a block diagram of a board according to an embodiment of the present disclosure.
  • the board may include other supporting components in addition to the chip 189 described above.
  • the supporting components include but are not limited to: a storage device 190, Interface device 191 and control device 192;
  • the storage device 190 is connected to the artificial intelligence chip through a bus for storing data.
  • the storage device may include multiple groups of storage units 193. Each group of the storage unit and the artificial intelligence chip are connected through a bus. It can be understood that each group of the storage units may be DDR SDRAM (English: Double Data Rate SDRAM, double-rate synchronous dynamic random access memory).
  • the storage device may include 4 groups of the storage units. Each group of the storage unit may include a plurality of DDR4 particles (chips).
  • the artificial intelligence chip may include four 72-bit DDR4 controllers. In the 72-bit DDR4 controller, 64 bits are used for data transmission and 8 bits are used for ECC verification.
  • each group of the storage unit includes a plurality of double-rate synchronous dynamic random access memories arranged in parallel.
  • DDR can transmit data twice in one clock cycle.
  • a controller for controlling the DDR is provided in the chip for controlling the data transmission and data storage of each storage unit.
  • the interface device is electrically connected with the artificial intelligence chip.
  • the interface device is used to implement data transmission between the artificial intelligence chip and an external device (such as a server or a computer).
  • the interface device may be a standard PCIE interface.
  • the data to be processed is transferred from the server to the chip through a standard PCIE interface to realize data transfer.
  • the interface device may also be other interfaces.
  • the present disclosure does not limit the specific manifestations of the above other interfaces, as long as the interface unit can realize the switching function.
  • the calculation result of the artificial intelligence chip is still transmitted by the interface device back to an external device (such as a server).
  • the control device is electrically connected with the artificial intelligence chip.
  • the control device is used to monitor the state of the artificial intelligence chip.
  • the artificial intelligence chip and the control device may be electrically connected through an SPI interface.
  • the control device may include a single-chip microcomputer (Micro Controller Unit, MCU).
  • MCU Micro Controller Unit
  • the artificial intelligence chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, and can drive multiple loads. Therefore, the artificial intelligence chip can be in different working states such as multi-load and light-load.
  • the control device can realize the regulation and control of the working states of multiple processing chips, multiple processing and/or multiple processing circuits in the artificial intelligence chip.
  • the embodiments of the present disclosure also provide a computer-readable storage medium on which computer program instructions are stored, and the computer program instructions implement the above-mentioned method when executed by a processor.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium.
  • the embodiment of the present disclosure also provides an electronic device including the above-mentioned processor.
  • the functional units/modules in the various embodiments of the present disclosure may be integrated into one unit/module, or each unit/module may exist alone physically, or two or more units/modules may exist.
  • the modules are integrated together.
  • the above-mentioned integrated unit/module can be realized in the form of hardware or software program module.
  • the hardware may be a digital circuit, an analog circuit, and so on.
  • the physical realization of the hardware structure includes but is not limited to transistors, memristors and so on.
  • the integrated unit/module is implemented in the form of a software program module and sold or used as an independent product, it can be stored in a computer readable memory.
  • the technical solution of the present disclosure essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory, A number of instructions are included to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the various embodiments of the present disclosure.
  • the aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
  • the present disclosure may be a system, method and/or computer program product.
  • the computer program product may include a computer-readable storage medium loaded with computer-readable program instructions for enabling a processor to implement various aspects of the present disclosure.
  • the computer-readable storage medium may be a tangible device that can hold and store instructions used by the instruction execution device.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable processing described above.
  • Non-exhaustive list of computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM) Or flash memory), static random access memory (SRAM), portable compact disk read-only memory (CD-ROM), digital versatile disk (DVD), memory stick, floppy disk, mechanical encoding device, such as a printer with instructions stored thereon The protruding structure in the hole card or the groove, and any suitable treatment described above.
  • RAM random access memory
  • ROM read-only memory
  • EPROM erasable programmable read-only memory
  • flash memory flash memory
  • SRAM static random access memory
  • CD-ROM compact disk read-only memory
  • DVD digital versatile disk
  • memory stick floppy disk
  • mechanical encoding device such as a printer with instructions stored thereon
  • the computer-readable storage medium used here is not interpreted as the instantaneous signal itself, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (for example, light pulses through fiber optic cables), or through wires Transmission of electrical signals.
  • the computer-readable program instructions described herein can be downloaded from a computer-readable storage medium to various computing/processing devices, or downloaded to an external computer or external storage device via a network, such as the Internet, a local area network, a wide area network, and/or a wireless network.
  • the network may include copper transmission cables, optical fiber transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
  • the network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network, and forwards the computer-readable program instructions for storage in the computer-readable storage medium in each computing/processing device .
  • the computer program instructions used to perform the operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, state setting data, or in one or more programming languages.
  • Arbitrary processing of the written source code or object code, the programming language includes object-oriented programming languages-such as Smalltalk, C++, etc., and conventional procedural programming languages-such as "C" language or similar programming languages.
  • Computer-readable program instructions can be executed entirely on the user's computer, partly on the user's computer, executed as a stand-alone software package, partly on the user's computer and partly executed on a remote computer, or entirely on the remote computer or server implement.
  • the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (for example, using an Internet service provider to access the Internet). connect).
  • LAN local area network
  • WAN wide area network
  • an electronic circuit such as a programmable logic circuit, a field programmable gate array (FPGA), or a programmable logic array (PLA), can be customized by using the status information of the computer-readable program instructions.
  • FPGA field programmable gate array
  • PDA programmable logic array
  • the computer-readable program instructions are executed to realize various aspects of the present disclosure.
  • These computer-readable program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, thereby producing a machine that makes these instructions when executed by the processor of the computer or other programmable data processing device , A device that implements the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams is produced. It is also possible to store these computer-readable program instructions in a computer-readable storage medium. These instructions make computers, programmable data processing apparatuses, and/or other devices work in a specific manner, so that the computer-readable medium storing the instructions includes An article of manufacture, which includes instructions for implementing various aspects of the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.
  • each block in the flowchart or block diagram may represent a module, program segment, or part of an instruction, and the module, program segment, or part of an instruction contains one or more components for realizing the specified logical function.
  • Executable instructions may also occur in a different order from the order marked in the drawings. For example, two consecutive blocks can actually be executed substantially in parallel, or they can sometimes be executed in the reverse order, depending on the functions involved.
  • each block in the block diagram and/or flowchart, and the processing of the blocks in the block diagram and/or flowchart can be implemented by a dedicated hardware-based system that performs the specified functions or actions , Or it can be realized by the processing of dedicated hardware and computer instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Complex Calculations (AREA)

Abstract

一种运算方法、处理器以及相关产品。所述产品包括存储器件(390)、接口装置(391)和控制器件(392)以及人工智能芯片(389);其中,所述人工智能芯片(389)与所述存储器件(390)、所述控制器件(392)以及所述接口装置(391)分别连接;所述存储器件(390),用于存储数据;所述接口装置(391),用于实现所述人工智能芯片(389)与外部设备之间的数据传输;所述控制器件(392),用于对所述人工智能芯片(389)的状态进行监控。通过以上运算方法或相关产品,可以提高相关产品在进行矩阵乘法运算时的运算效率。

Description

运算方法、处理器以及相关产品 技术领域
本公开涉及信息处理技术领域,特别是涉及一种运算方法、处理器以及相关产品。
背景技术
在人工智能技术领域,神经网络算法是最近非常流行的一种机器学习算法,在各种领域中都取得了非常好的效果,比如图像识别,语音识别,自然语言处理等。随着神经网络算法的发展,算法的复杂度也越来越高,为了提高识别度,模型的规模也在逐渐增大。用GPU和CPU处理起这些大规模的模型,要花费大量的计算时间,并且耗电量很大。
发明内容
基于此,有必要针对上述技术问题,提供一种能够提高运算效率的运算方法、处理器及相关产品。
根据本公开的一方面,提供了一种基于处理元件矩阵的矩阵乘的运算方法,应用于处理器,所述处理器包括两个以上处理元件,所述两个以上处理元件以二维矩阵排列,处理元件包括至少一个寄存器,所述方法实现对第一矩阵和第二矩阵的矩阵乘法运算,
所述方法包括:
将第一矩阵加载到处理元件的寄存器中,第一矩阵中的元素在矩阵中的排列方式和在处理元件的寄存器中的排列方式相同;
针对第二矩阵的每一行,将所述每一行中的元素与第一矩阵的每一列元素对应存储到处理元件的寄存器,与第一矩阵的每一列中的元素分别求乘积,计算一列乘积的和得到第一中间结果;或者,针对第二矩阵的每一列,将所述每一列中的元素与第一矩阵的每一行元素对应存储到处理元件的寄存器,与第一矩阵的每一行中的元素分别求乘积,计算一行乘积的和得到第一中间结果;
将第一中间结果进行处理得到第一矩阵和第二矩阵的乘积。
根据本公开的另一方面,提供了一种处理器,所述处理器包括两个以上处理元件,所述两个以上处理元件以二维矩阵排列,处理元件包括至少一个寄存器,所述处理器用于对第一矩阵和第二矩阵执行矩阵乘法运算,
所述处理器还包括控制器,所述控制器用于将第一矩阵加载到处理元件的寄存器中;
针对第二矩阵的每一行,所述控制器用于将所述每一行中的元素与第一矩阵的每一列元素对应存储到处理元件的寄存器,与第一矩阵的每一列中的元素分别求乘积,计算一列乘积的和得到第一中间结果;或者,针对第二矩阵的每一列,所述控制器用于将所述每一列中的元素与第一矩阵的每一行元素对应存储到处理元件的寄存器,与第一矩阵的每一行中的元素分别求乘积,计算一行乘积的和得到第一中间结果;
所述控制器还用于将第一中间结果进行处理得到第一矩阵和第二矩阵的乘积。
根据本公开的另一方面,提供了一种人工智能芯片,所述芯片包括如上所述的处理器。
根据本公开的另一方面,提供了一种电子设备,包括如上所述的人工智能芯片。
根据本公开的另一方面,提供了一种电子设备,包括如上所述的处理器。
根据本公开上述各实施方式的矩阵乘的运算方法、处理器,更适用于以阵列排布的处理元件组成的处理器,运算效率高。且对于满足处理元件的排列的任意规模的输入矩阵,可以得到矩阵乘法的运 算结果,可以减少访存次数,降低带宽压力,提高运算的效率。
根据本公开的第一方面,提供了一种处理器,所述处理器包括两个以上处理元件,所述两个以上处理元件以二维矩阵排列,处理元件包括至少一个寄存器,所述处理器用于对第一矩阵和第二矩阵执行矩阵乘法运算,
所述处理器还包括控制器,所述控制器用于将第一矩阵的转置矩阵和第二矩阵的各元素分别加载到各处理元件的寄存器中,所述转置矩阵和所述第二矩阵对应位置的元素存储在同一处理元件的寄存器中;
所述控制器用于控制所述转置矩阵或者第二矩阵在行方向或者列方向滚动,控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积、将同一行或同一列的元素乘积求和得到第一中间结果;
所述控制器还用于对所述第一中间结果进行处理得到第一矩阵和第二矩阵的乘积。
根据本公开的第二方面,提供了一种基于处理元件矩阵的矩阵乘的运算方法,应用于处理器,所述处理器包括两个以上处理元件,所述两个以上处理元件以二维矩阵排列,处理元件包括至少一个寄存器,所述方法实现对第一矩阵和第二矩阵的矩阵乘法运算,所述方法包括:
将第一矩阵进行转置得到转置矩阵,将转置矩阵和第二矩阵的各元素分别加载到各处理元件的寄存器中,转置矩阵和第二矩阵对应位置的元素存储在同一处理元件的寄存器中;
控制所述转置矩阵或者第二矩阵在行方向或者列方向滚动,控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积、将同一行或同一列的元素乘积求和得到第一中间结果;
对所述第一中间结果进行处理得到第一矩阵和第二矩阵的乘积。
根据本公开的第三方面,提供了一种人工智能芯片,所述芯片包括如上所述的处理器。
根据本公开的第四方面,提供了一种电子设备,包括如上所述的人工智能芯片。
根据本公开上述各实施方式的矩阵乘的运算方法、处理器等产品,对于满足处理元件的排列的任意规模的输入矩阵,都可以得到矩阵乘法的运算结果,并且相比于相关技术中的矩阵乘运算可以减少访存次数,降低带宽压力,提高运算的效率。
根据本公开的一方面,提供了一种基于处理元件矩阵的矩阵乘的运算方法,应用于处理器,所述处理器包括两个以上处理元件,所述两个以上处理元件以二维矩阵排列,处理元件包括至少一个寄存器,所述方法实现对第一矩阵和第二矩阵的矩阵乘法运算,所述方法包括:
对第一矩阵和第二矩阵进行预处理得到第三矩阵和第四矩阵,其中第三矩阵和第四矩阵都为p×p矩阵,p=max(m,k,n),m表示第一矩阵的行秩,n表示第二矩阵的列秩,第一矩阵的列秩和第二矩阵的行秩为k,p为m、k、n三者中的最大值;
将所述第三矩阵和所述第四矩阵以行列对齐的方式加载到处理元件的寄存器中,加载后第三矩阵和第四矩阵对应位置的元素存储在同一处理元件的寄存器中;
对第三矩阵和第四矩阵在行方向或者列方向进行滚动,控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积矩阵;
根据对第一矩阵和第二矩阵预处理的方式对元素乘积矩阵进行处理得到第一矩阵和第二矩阵的乘积。
在一种可能的实现方式中,对第三矩阵和第四矩阵在行方向或者列方向进行滚动,控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积矩阵,包括:
控制处理元件对相应的寄存器内的元素进行乘法运算得到第一元素乘积矩阵;
将第三矩阵整体向左滚动一次、将第四矩阵整体向上滚动一次,或者,将第三矩阵整体向右滚动 一次、将第四矩阵整体向下滚动一次,控制处理元件对相应的寄存器内的元素进行乘法运算得到第二元素乘积矩阵,重复p-1次得到第二元素乘积矩阵。
在一种可能的实现方式中,根据对第一矩阵和第二矩阵预处理的方式对元素乘积矩阵进行处理得到第一矩阵和第二矩阵的乘积,包括:
将第一元素乘积矩阵和第二元素乘积矩阵求和得到第五矩阵,根据对第一矩阵和第二矩阵预处理的方式对第五矩阵进行处理得到第一矩阵和第二矩阵的乘积。
根据本公开的另一方面,提供了一种处理器,所述处理器包括两个以上处理元件,所述两个以上处理元件以二维矩阵排列,处理元件包括至少一个寄存器,所述处理器用于对第一矩阵和第二矩阵的矩阵乘法运算,所述处理器还包括控制器,所述控制器用于对第一矩阵和第二矩阵进行预处理得到第三矩阵和第四矩阵,其中,第三矩阵和第四矩阵对应位置的元素存储在同一处理元件的寄存器中,第三矩阵和第四矩阵都为p×p矩阵,p=max(m,k,n),m表示第一矩阵的行秩,n表示第二矩阵的列秩,第一矩阵的列秩和第二矩阵的行秩为k,p为m、k、n三者中的最大值;
所述控制器用于对第三矩阵和第四矩阵在行方向或者列方向进行滚动,控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积矩阵;
所述控制器用于根据对第一矩阵和第二矩阵预处理的方式对元素乘积矩阵进行处理得到第一矩阵和第二矩阵的乘积。
根据本公开的另一方面,提供了一种基于处理元件矩阵的矩阵乘的运算装置,包括:上述处理器。
根据本公开的另一方面,提供了一种非易失性计算机可读存储介质,其上存储有计算机程序指令,其中,所述计算机程序指令被处理器执行时实现上述方法。
根据本公开的另一方面,提供了一种人工智能芯片,所述芯片包括如上所述的处理器。
根据本公开的另一方面,提供了一种电子设备,包括如上所述的人工智能芯片。
根据本公开上述各实施方式的矩阵乘的运算方法、处理器及相关产品,进行矩阵乘法运算时不需要反复读取数据,减少读取内存的次数,降低带宽压力,运算效率高。且对于任意规模的输入矩阵,都可以通过预处理的方式对输入矩阵进行变换,然后进行运算,可以得到矩阵乘法的运算结果。
根据下面参考附图对示例性实施例的详细说明,本公开的其它特征及方面将变得清楚。
附图说明
包含在说明书中并且构成说明书的一部分的附图与说明书一起示出了本公开的示例性实施例、特征和方面,并且用于解释本公开的原理。
图1-1示出根据本公开一实施例的处理器的示意图。
图1-2a和图1-2b分别示出了不同的划分方式的示例。
图1-3示出根据本公开一实施例的运算方法的流程图。
图1-4示出根据本公开一实施例的处理元件组成的阵列的示意图。
图1-5示出根据本公开一实施例的分块的示意图。
图1-6示出根据本公开一实施例的对矩阵划分的示例。
图2-1示出根据本公开一实施例的处理器的示意图。
图2-2a和图2-2b分别示出了多种不同的划分方式的示例。
图2-3示出根据本公开一实施例的运算方法的流程图。
图2-4示出根据本公开一实施例的处理元件组成的阵列的示意图。
图2-5示出根据本公开一实施例的分块的示意图。
图2-6示出根据本公开一实施例的对矩阵划分的示例。
图3-1示出根据本公开一实施例的处理器的示意图。
图3-2a和图3-2b分别示出了不同的划分矩阵的方式的示例。
图3-3示出根据本公开一实施例的运算方法的流程图。
图3-4示出根据本公开一实施例的分块的示意图。
图4示出根据本公开实施例的板卡的结构框图。
具体实施方式
下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。
应当理解,本公开的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本公开的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本公开说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本公开。如在本公开说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本公开说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何处理以及所有可能处理,并且包括这些处理。
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。
矩阵运算在利用人工智能对信息进行处理的过程中占据比较大的计算量,并且现有的处理器在处理矩阵运算的过程中把矩阵运算拆解成乘法运算和加法运算,需要频繁的从内存中读取数据,运算的效率很低。
相关技术中,对于输入矩阵规模比较大的矩阵乘法,为了提高矩阵运算的效率,通常采用多级流水线的方式实现运算的过程,但多级流水线由于每一级对输入数据中的一部分进行处理,因此,需要频繁的从内存中读取数据,频繁访问内存导致对带宽的要求较高。
为了解决上述技术问题,本公开提供了一种运算方法以及执行该运算方法的处理器。处理器可以包括多个处理元件,在一些实施方式中,多个处理元件可以以二维矩阵的形式排列以更好的适应矩阵运算,每个处理元件可以包括至少一个寄存器。
图1-1示出根据本公开一实施例的处理器的示意图。如图1-1所示,多个处理元件PE(Processing Element)以二维矩阵的形式排列,每个处理元件与相邻的处理元件之间连接,每个PE中可以设置有至少一个寄存器(register)(图中未示出)。处理器还可以包括控制器和存储器,其中,控制器和存储 器都与多个处理元件连接,且控制器可以连接存储器。所述控制器用于从存储器中加载数据到处理元件的寄存器中,并控制处理元件对输入数据进行处理。
在本公开的实施例的运算过程中,控制器可以先将一个矩阵的元素加载到各个PE对应的寄存器中,然后将另一个矩阵的元素按行或者按列或者按照元素遍历的方式根据加载到寄存器的矩阵中的元素加载的位置存储至对应的寄存器中,然后控制每个PE对PE内设置的寄存器存储的元素进行运算。
在一种可能的实现方式中,存储器中还可以存储有可执行程序,可执行程序中可以包括指令,处理器执行指令可以实现矩阵乘法运算。控制器中可以设置有加载器、译码器等,其中,加载器可以用于将存储器中的输入数据加载到处理元件的寄存器中,译码器可以根据加载后输入数据的存储地址的变化对可执行程序中访问数据的指令进行译码,比如说,对于访问数据的指令,通过译码获得数据在寄存器中存储的地址赋值给访问数据的指令,并将译码后的指令发送给处理元件,由处理元件执行指令,从而实现对数据的处理。
在一种可能的实现方式中,存储器可以为片上缓存,控制器可以将片外闪存上的可执行程序以及输入数据(例如,输入矩阵,包括左乘矩阵和右乘矩阵)加载到上述存储器(片上缓存)中,再进行之后的矩阵乘法运算的过程。
在一种可能的实现方式中,控制器也可以直接从片外内存上加载输入矩阵以及可执行程序到处理元件的寄存器中,本公开对此不作限定。
PE中还可以包括运算器以完成指定的运算,以矩阵运算为例,PE中可以包括例如乘法器、加法器等,各个PE中的具体结构可以相同,也可以存在不同,本公开对此不作限定。PE中还可以包括其他类型的运算器,以适应各种不同的运算过程,本公开对PE包括的运算器的数量和类型不作限定。
乘法操作的输入矩阵可以包括左乘矩阵和右乘矩阵,其中,左乘矩阵可以是指位于乘号左边的矩阵,右乘矩阵可以是指位于乘号右边的矩阵。
本公开提供的运算方法用于实现对第一矩阵和第二矩阵的矩阵乘法运算。其中,在一个示例中,第一矩阵可以为左乘矩阵,第二矩阵可以为右乘矩阵;在另一个示例中,第一矩阵可以为右乘矩阵,第二矩阵可以为左乘矩阵。
本公开的实施方式中,控制器可以将输入矩阵中的一个矩阵确定为待加载矩阵。由于处理器中PE的数量以及排列方式是固定的,因此,在一些情况下控制器可以对待加载矩阵进行分块,在一些情况下,可以不对加载到处理器中的矩阵进行分块。对于输入矩阵中除了待加载矩阵以外的另一矩阵,可以不进行分块处理。
在一种可能的实现方式中,控制器可以从输入矩阵中确定待加载矩阵,根据处理元件的排列以及待加载矩阵的行数和列数确定是否对待加载矩阵进行分块。其中,处理元件的排列可以是指处理元件的行数和列数,待加载矩阵的行秩、列秩可以是指该矩阵的行数和列数。待加载矩阵可以是左乘矩阵,也可以是右乘矩阵,本公开对此不作限定。
若待加载矩阵的行数不大于处理元件的行数、且待加载矩阵的列数不大于处理元件的列数,则控制器可以不对待加载矩阵进行分块,若待加载矩阵的行数大于处理元件的行数,或者待加载矩阵的列数大于处理元件的列数,则控制器可以对待加载矩阵进行分块。
在一种可能的实现方式中,在从输入矩阵中确定待加载矩阵时,控制器可以随机确定,也可以根 据处理元件的排列优先确定不需要进行分块的矩阵为待加载矩阵,本公开对具体的确定方式不作限定。
比如说,假设处理元件组成的阵列可以表示为PE MN,表示处理元件为M×N的矩阵,其中,M表示处理元件的行数、N表示处理元件的列数,M和N都为大于0的正整数。假设左乘矩阵为a mn,表示左乘矩阵为m×n的矩阵,其中,m表示矩阵a mn的行数,n表示矩阵a mn的列数,m和n都为正整数,右乘矩阵为b nk,表示右乘矩阵为n×k的矩阵,其中n为矩阵b nk的行数,k为矩阵b nk的列数,k为正整数。如果m小于M、n小于N,n大于M或者k大于N,那么控制器可以优选矩阵a mn为待加载矩阵。
在一种可能的实现方式中,若两个输入矩阵都满足不需要分块的条件,即都可以作为待加载矩阵,此时控制器可以随机确定其中一个为待加载矩阵,也可以选择包含元素较多的矩阵作为待加载矩阵,这样可以减少加载元素的次数,提高运算效率。
若要对待加载矩阵进行分块,则控制器可以根据待处理元件的排列以及待加载矩阵的行秩以及列秩对待加载矩阵进行分块得到两个以上第一矩阵。
需要说明的是,本公开的示例中以加载第一矩阵到各处理元件为例,也就是将待加载矩阵作为第一矩阵或者将对待加载矩阵分块后得到的矩阵作为第一矩阵。
对于不需要分块的情况,如果加载的第一矩阵为左乘矩阵,那么控制器可以将右乘作为第二矩阵,如果加载的第一矩阵为右乘矩阵,那么控制器可以将左乘矩阵作为第二矩阵。
对于需要分块的情况,如果对待加载矩阵进行分块得到两个以上第一矩阵,那么控制器可以根据情况对输入矩阵中的另一个矩阵进行处理。
如果处理元件包括的寄存器无法存储全部的第一矩阵,这时,根据对待加载矩阵分块的方式的不同,控制器可以对输入矩阵中待加载矩阵以外的另一个矩阵进行分块,也可以不进行分块。
比如说,若待加载矩阵为左乘矩阵,对待加载矩阵在行方向进行了分块,此时控制器可以不对另一个矩阵进行分块;如果待加载矩阵为左乘矩阵,对待加载矩阵在列方向进行了分块,此时控制器可以根据对待加载矩阵分块的方式,将输入矩阵中待加载矩阵以外的另一个矩阵进行分块得到两个以上第二矩阵。
若待加载矩阵为右乘矩阵,对待加载矩阵在行方向进行了分块,此时控制器可以根据对待加载矩阵分块的方式,将输入矩阵中待加载矩阵以外的另一个矩阵进行分块得到两个以上第二矩阵;如果待加载矩阵为右乘矩阵,对待加载矩阵在列方向进行了分块,此时控制器可以不对另一个矩阵进行分块。
如果待加载矩阵为a mn,那么根据矩阵a mn的行数和列数以及处理元件的行数和列数确定是否需要对矩阵a mn进行分块,如果矩阵a mn的行数m不大于处理元件的行数M、且列数n不大于处理元件的列数N,则可以不对矩阵a mn进行分块。如果矩阵a mn的行数m大于处理元件的行数M、或者列数n大于处理元件的列数N,则可以对矩阵a mn在行方向或者列方向进行分块。
如果待加载矩阵为b nk,那么根据矩阵b nk的行数和列数以及处理元件的行数和列数确定是否需要对矩阵b nk进行分块,如果矩阵b nk的行数n不大于处理元件的行数M、且列数k不大于处理元件的列数N,则可以不对矩阵b nk进行分块。如果矩阵b nk的行数n大于处理元件的行数M、或列数k大于处理元件的列数N,则可以对矩阵b nk在行方向或者列方向进行分块。
在一种可能的实现方式中,分块后得到的矩阵满足不需要再进行分块的条件,也就是说,分块后矩阵的行数不大于处理元件的行数、且列数不大于处理元件的列数。
如果矩阵a mn的行数m大于处理元件的行数M、列数n不大于处理元件的列数N,则控制器可以对矩阵a mn在行方向进行分块,由于矩阵a mn为左乘矩阵,因此在行方向进行分块,并不影响与右乘矩阵的正常的运算,因此控制器可以不对右乘矩阵进行分块处理。如果矩阵a mn的行数m不大于处理元件的行数M、列数n大于处理元件的列数N,则可以对矩阵a mn在列方向进行分块,此时,控制器可以根据对矩阵a mn在列方向进行分块的方式对右乘矩阵的行方向进行分块,对左乘矩阵列方向和右乘矩阵行方向以相同的方式进行分块,所述相同的方式分块指的是分块后所得的第一矩阵的列数和第二矩阵的行数是相同的,以保证能正常完成矩阵运算。如果矩阵a mn的行数m大于处理元件的行数M、列数n大于处理元件的列数N,则控制器可以对矩阵a mn在行方向和列方向进行分块,可以根据对矩阵a mn在列方向进行分块的方式对右乘矩阵的行方向进行分块,对左乘矩阵列方向和右乘矩阵行方向以相同的方式进行分块,所述相同的方式分块指的是分块后所得的第一矩阵的列数和第二矩阵的行数是相同的,以保证能正常完成矩阵运算。
如果矩阵b nk的行数n不大于处理元件的行数M、列数k大于处理元件的列数N,则控制器可以对矩阵b nk在列方向进行分块。由于矩阵b nk为右乘矩阵,因此在列方向进行分块并不影响与左乘矩阵的正常的运算,因此控制器可以不对左乘矩阵进行分块处理。如果矩阵b nk的行数n大于处理元件的行数M、列数k不大于处理元件的列数N,则可以对矩阵b nk在行方向进行分块,此时,控制器可以根据对矩阵b nk在行方向进行分块的方式对左乘矩阵的列方向进行分块,对左乘矩阵列方向和右乘矩阵行方向以相同的方式进行分块,所述相同的方式分块指的是分块后所得的第一矩阵的列数和第二矩阵的行数是相同的,以保证能正常完成矩阵运算。如果矩阵b nk的行数n大于处理元件的行数M、列数k大于处理元件的列数N,则控制器可以对矩阵b nk在行方向和列方向进行分块,此时,控制器可以根据对矩阵b nk在行方向进行分块的方式对左乘矩阵的列方向进行分块,对左乘矩阵列方向和右乘矩阵行方向以相同的方式进行分块,所述相同的方式分块指的是分块后所得的第一矩阵的列数和第二矩阵的行数是相同的,以保证能正常完成矩阵运算。
在一种可能的实现方式中,可以按照分块后的矩阵的行秩和列秩尽量接近处理元件的行数和列数的方式进行分块,这样可以提高运算的效率,缩短运算时间。也就是说,假设处理元件为4×4的阵列,那么可以先按照分块后的矩阵为4×4的方式进行分块,这样可以最大效率的利用处理元件,提高运算效率。
举例来说,假设处理元件为2×2的阵列,左乘矩阵为2×4矩阵、右乘矩阵为4×3矩阵,这种情况下不管是加载左乘矩阵还是右乘矩阵,都需要对两者进行分块。分块的方式可以有很多种,图1-2a和图1-2b分别示出了多种不同的分块方式,矩阵a 24在列方向和矩阵b 43在行方向以相同的方式进行分块。图1-2a是分块的一个示例,矩阵a 24在列方向划分为两部分,每一部分包括两列,矩阵b 43在行方向划分为两部分,每一部分包括两行;图1-2b是分块的另一个示例,矩阵a 24在列方向划分为三部分,其中一部分包括两列、另外两部分都包括一列,矩阵b 43在行方向划分为三部分,其中一部分包括两行、另外两部分都包括一行。以上处理元件的排列以及输入矩阵的分块方式仅仅是本公开的一个示例,不以任何方式限制本公开。
图1-2a中的分块方式划分出的矩阵的行秩和列秩更接近处理元件的行数和列数,这样,能够有助于提高处理元件的利用率,并且降低控制复杂度,对于相同的输入矩阵,由于分块后的块数较少,因 此加载数据的次数少,这种分块方式运算的效率更高。
对于左乘矩阵的行方向和右乘矩阵的列方向的分块方式,本公开不作具体的限定,只要分块后的矩阵都满足不需要再进行分块的条件即可。
在一种可能的实现方式中,如果处理元件包含的寄存器的数量可以满足存储输入矩阵的需求,那么还可以采用堆叠存储的方式将划分后的第一矩阵存储到处理元件的寄存器中,来实现输入矩阵的乘法运算。比如说,每个处理元件可以包括多个寄存器,控制器可以把处理元件中的寄存器分为多个不同的组,控制器在对所述输入矩阵进行分块后,可以在多组寄存器中堆叠存储所述两个以上第一矩阵,每组存储一个第一矩阵。在该实施方式中,控制器可以将输入矩阵中待加载矩阵以外的另一个矩阵作为第二矩阵。需要说明的是,堆叠存储仅仅是一种可选的实现方式,本公开不限于此。
图1-3示出根据本公开一实施例的运算方法的流程图。以不需要对待加载矩阵进行分块为例,先对本公开的运算方法进行说明,假设待加载矩阵为第一矩阵,输入矩阵中除了待加载矩阵以外的另一个矩阵为第二矩阵,如图1-3所示,本公开提供的运算方法可以包括以下步骤:
步骤S1-11,将第一矩阵加载到各处理元件的寄存器中;
在一种可能的实现方式中,第一矩阵中的元素在矩阵中的排列方式和在处理元件的寄存器中的排列方式相同;
步骤S1-12,针对第二矩阵的每一行或者每一列,将所述每一行或者每一列中的元素与第一矩阵的每一列或每一行元素对应存储到处理元件的寄存器,与第一矩阵的每一列或每一行中的元素分别求乘积,计算一列或一行乘积的和得到第一中间结果;也就是说,针对第一矩阵的每一行或者每一列,将每一行或者每一列的元素存储到第一矩阵的每一列或者每一行元素存储的寄存器所在的处理元件的寄存器中。
也就是说,针对第二矩阵的每一行,将所述每一行中的元素与第一矩阵的每一列元素对应存储到处理元件的寄存器,与第一矩阵的每一列中的元素分别求乘积,计算一列乘积的和得到第一中间结果;或者,针对第二矩阵的每一列,将所述每一列中的元素与第一矩阵的每一行元素对应存储到处理元件的寄存器,与第一矩阵的每一行中的元素分别求乘积,计算一行乘积的和得到第一中间结果。
步骤S1-13,将第一中间结果进行处理得到第一矩阵和第二矩阵的乘积。
对于不分块的情况,控制器可以直接把左乘矩阵作为第一矩阵、右乘矩阵作为第二矩阵,或者将左乘矩阵作为第二矩阵、右乘矩阵作为第一矩阵,本公开对此不作限定。
在一个示例中,第一矩阵为左乘矩阵,第二矩阵为右乘矩阵,那么在步骤S1-12中,针对第二矩阵中的每一列元素,可以将该列元素中的每个元素与第一矩阵中对应的一列元素存储到处理元件的寄存器(或者说,将该列元素中的每个元素存储至第一矩阵中对应的一列元素存储的寄存器所在的处理元件的寄存器中),控制每一个处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,计算每一行元素乘积的和得到第一中间结果。其中,第一矩阵中与所述每个元素对应的一列元素是指,该元素在所述第二矩阵中的行数与一列元素在第二矩阵中的列数相同。
在另一个示例中,第一矩阵为右乘矩阵,第二矩阵为左乘矩阵,那么在步骤S1-12中,针对第二矩阵中的每一行元素,可以将该行元素中的每个元素与第一矩阵中对应的一行元素存储到处理元件的寄存器,控制每一个处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,计算每一列元素 乘积的和得到第一中间结果。其中,第一矩阵中与所述每个元素对应的一行元素是指,该元素在所述第二矩阵中的列数与一行元素所在的行数相同。
根据加载到处理器中的矩阵为左乘矩阵或者右乘矩阵,步骤S1-13中对第一中间结果的处理方式不同。具体地,若第一矩阵为左乘矩阵,那么,得到的第一中间结果作为第一矩阵和第二矩阵的乘积矩阵的一列元素,第一中间结果在乘积矩阵中的列数与进行运算得到第一中间结果的第二矩阵中的列的列数相同;若第一矩阵为右乘矩阵,那么,得到的第一中间结果作为第一矩阵和第二矩阵的乘积矩阵的一行元素,第一中间结果在乘积矩阵中的行数与进行运算得到第一中间结果的第二矩阵中的行的行数相同。
在一种可能的实现方式中,对于同一行或者同一列的处理元件,控制器可以控制该行或者该列的处理元件将每次计算得到的元素乘积移动到该行或者该列的一个处理元件中,并控制该行或者该列的一个处理元件计算元素乘积的和得到第一中间结果。比如说,在第一矩阵为左乘矩阵、第二矩阵为右乘矩阵时,在每次计算得到元素乘积时,控制器可以控制同一行的处理元件将计算得到的元素乘积移动到该行的一个处理元件中,并控制该一个处理元件计算元素乘积的和得到第一中间结果;在第一矩阵为右乘矩阵、第二矩阵为左乘矩阵时,在每次计算得到元素乘积时,控制器可以控制同一列的处理元件将计算得到的元素乘积移动到该列的一个处理元件中,并控制该一个处理元件计算元素乘积的和得到第一中间结果。其中,处理元件可以采用加法器计算元素乘积的和。其中的一个处理元件可以是存储有第一矩阵的元素的处理元件,也可以是未存储第一矩阵的元素的处理元件,本公开对此不作限定。
以上示例仅仅是计算第一中间结果的一种方式,本公开不限于此,比如,还可以在处理元件阵列的行或者列上设置专门的加法器用于实现上述计算过程。
示例1-1 第一矩阵为左乘矩阵,第二矩阵为右乘矩阵
假设第一矩阵a mn和第二矩阵b nk都为3×3矩阵,处理元件为4×4的阵列。
图1-4示出根据本公开一实施例的处理元件组成的阵列的示意图。结合图1-4以及图1-3对本公开的运算方法进行说明。
假设第一矩阵
Figure PCTCN2021075957-appb-000001
第二矩阵
Figure PCTCN2021075957-appb-000002
将第一矩阵加载到所述处理元件的寄存器中,可以按照第一矩阵的行和列的排列方式加载到所述处理元件的寄存器中,也就是说,第一矩阵中的元素在矩阵中的排列方式和在处理元件的寄存器中的排列方式相同,换言之,所述排列方式相同指的是矩阵中所有元素的行下标与其所处的处理元件的行差值相同、所有元素的列下标与其所处的处理元件的列下标的差值相同。
在一种可能的实现方式中,第一矩阵中的元素在矩阵中的行列数与加载有该元素的处理元件在处理元件组成的阵列中的行列数相同。
举例来说,在一个示例中,控制器可以将A 11加载到PE 11的寄存器中、A 12加载到PE 12的寄存器中、A 13加载到PE 13的寄存器中、A 21加载到PE 21的寄存器中…A 33加载到PE 33的寄存器中,也就是说,第一矩阵中元素的下标可以与其所处的处理元件的下标完全相同,上述行下标差值和列下标差值都为0。
在另一个示例中,控制器可以将A 11加载到PE 12的寄存器中、A 12加载到PE 13的寄存器中、A 13加载到PE 14的寄存器中、A 21加载到PE 22的寄存器中…A 33加载到PE 34的寄存器中,也就是说,第一矩阵中的元素在矩阵中的排列方式和在处理元件的寄存器中的排列方式相同,行下标差值为0、列下标的差值为1。
需要说明的是,以上两个示例仅仅是加载第一矩阵的一些示例,不以任何方式限制本公开,本领域技术人员应当知道,只要满足第一矩阵中的元素在矩阵中的排列方式和在处理元件的寄存器中的排列方式相同即可。
在一种可能的实现方式中,在加载完输入矩阵之后,对于步骤S1-12,控制器可以将第二矩阵的第一列中的元素B 11到第一矩阵中对应的一列元素存储到处理元件的寄存器,对应的一列元素是指该元素在所述第二矩阵中的行数与一列元素在第一矩阵中的列数相同,B 11在第一矩阵为第一行,那么对应的一列元素是指第一矩阵中的第一列元素。也就是说,控制器将元素B 11存储至A 11、A 21、A 31存储的寄存器所在的处理元件的寄存器中。
控制器将第二矩阵的第一列中的元素B 21存储至A 12、A 22、A 32存储的寄存器所在的处理元件的寄存器中,将第二矩阵的第一列中的元素B 31存储至A 13、A 23、A 33存储的寄存器所在的处理元件的寄存器中。
也就是说,B 11和A 11存储在同一个处理元件的寄存器中,B 11和A 21存储在同一个处理元件的寄存器中,B 11和A 31存储在同一个处理元件的寄存器中。B 21和A 12存储在同一个处理元件的寄存器中,B 21和A 22存储在同一个处理元件的寄存器中,B 21和A 32存储在同一个处理元件的寄存器中。B 31和A 13存储在同一个处理元件的寄存器中,B 31和A 23存储在同一个处理元件的寄存器中,B 31和A 33存储在同一个处理元件的寄存器中。
处理器中的控制器控制处理元件分别对对应的寄存器内存储的元素求乘积,然后计算每一行乘积的和得到第一中间结果分别为:B 11×A 11+B 21×A 12+B 31×A 13、B 11×A 21+B 21×A 22+B 31×A 23、B 11×A 31+B 21×A 32+B 31×A 33。假设第一矩阵和第二矩阵相乘得到的矩阵为C 33,那么上述第一中间结果可以表示为:C 11、C 21、C 31
在一种可能的实现方式中,示例性的,控制器可以将A 11加载到PE 11的寄存器中、A 12加载到PE 12的寄存器中、A 13加载到PE 13的寄存器中、A 21加载到PE 21的寄存器中…A 33加载到PE 33的寄存器中,也就是说,第一矩阵中元素的下标可以与其所处的处理元件的下标完全相同,上述行下标差值和列下标差值都为0。在本示例中,控制器将第二矩阵的第一列元素B 11、B 21、B 31存储至处理元件的寄存器之后,控制器控制处理元件采用乘法器对各自的寄存器中的元素求乘积得到元素乘积,控制器可以控制每一行处理元件将计算得到的元素乘积移动到该行的一个处理元件中,比如说,控制器可以控制PE 11、PE 12和PE 13将计算得到的元素乘积B 11×A 11、B 21×A 12、B 31×A 13移动到处理元件PE 14中,控制PE 14采用加法器对上述元素乘积求和得到C 11,需要说明的是,控制器也可以控制第一行的处理元件将元素乘积移动到PE 11、PE 12或者PE 13中,本公开对此不作限定。控制器控制第二行和第三行的处理元件执行类似的操作后,可以得到第一中间结果C 11、C 21、C 31
针对第二矩阵中的每一列,重复以上过程可以得到第一中间结果:C 12、C 22、C 32和C 13、C 23、C 33。利 用上述第一中间结果即可得到第一矩阵和第二矩阵的乘积
Figure PCTCN2021075957-appb-000003
在一种可能的实现方式中,对于得到的第一中间结果,可以按列存储即可得到第一矩阵和第二矩阵的乘积。也就是如上文所述的,第一矩阵为左乘矩阵时,以每一次得到的第一中间结果作为第一矩阵和第二矩阵的乘积矩阵的一列元素。第一中间结果在乘积矩阵中的列数与进行运算得到第一中间结果的第二矩阵中的列的列数相同是指,以上述示例为例,第二矩阵中的第一列元素与第一矩阵中的元素进行运算得到的第一中间结果C 11、C 21、C 31为c 33的第一列。
示例1-2 第一矩阵为右乘矩阵,第二矩阵为左乘矩阵
仍然假设第一矩阵a mn和第二矩阵b nk都为3×3矩阵,处理元件为4×4的阵列。
假设第一矩阵
Figure PCTCN2021075957-appb-000004
第二矩阵
Figure PCTCN2021075957-appb-000005
将第一矩阵加载到所输出处理元件的寄存器中,加载的方式可以参见示例1-1中加载第一矩阵的方式,不再赘述。
在加载完第一矩阵之后,对于步骤S1-12,将第二矩阵的第一行中的元素B 11与第一矩阵中对应的一行元素存储到处理元件的寄存器,对应的一行元素是指该元素在所述第二矩阵中的列数与一列元素在第一矩阵中的行数相同,B 11在第一矩阵为第一列,那么对应的一列元素是指第一矩阵中的第一行元素。也就是说,控制器可以将元素B 11存储至A 11、A 12、A 13存储的寄存器所在的处理元件的寄存器中。
将第二矩阵的第一行中的元素B 12存储至A 21、A 22、A 23存储的寄存器所在的处理元件的寄存器中,将第二矩阵的第一行中的元素B 13存储至A 31、A 32、A 33存储的寄存器所在的处理元件的寄存器中。
也就是说,B 11和A 11存储在同一个处理元件的寄存器中,B 11和A 12存储在同一个处理元件的寄存器中,B 11和A 13存储在同一个处理元件的寄存器中。B 12和A 21存储在同一个处理元件的寄存器中,B 12和A 22存储在同一个处理元件的寄存器中,B 12和A 23存储在同一个处理元件的寄存器中。B 13和A 31存储在同一个处理元件的寄存器中,B 13和A 32存储在同一个处理元件的寄存器中,B 13和A 33存储在同一个处理元件的寄存器中。
处理器中的控制器控制处理元件分别对对应的寄存器内存储的元素求乘积,然后计算每一列乘积的和得到第一中间结果分别为:B 11×A 11+B 12×A 21+B 13×A 31、B 11×A 12+B 12×A 22+B 13×A 32、B 11×A 13+B 12×A 23+B 13×A 33。假设第一矩阵和第二矩阵相乘得到的矩阵为C 33,那么上述第一中间结果可以表示为:C 11、C 12、C 13
在一种可能的实现方式中,示例性的,控制器可以将A 11加载到PE 11的寄存器中、A 12加载到PE 12的寄存器中、A 13加载到PE 13的寄存器中、A 21加载到PE 21的寄存器中…A 33加载到PE 33的寄存器中,也就是说,第一矩阵中元素的下标可以与其所处的处理元件的下标完全相同,上述行下标差值和列下标差值都为0。在本示例中,控制器将第二矩阵的第一行元素B 11、B 12、B 13存储至处理元件的寄存器之后,控制器控制处理元件采用乘法器对各自的寄存器中的元素求乘积得到元素乘积,控制器可以控制 每一列处理元件将计算得到的元素乘积移动到该列的一个处理元件中,比如说,控制器可以控制PE 11、PE 21和PE 31将计算得到的元素乘积B 11×A 11、B 12×A 21、B 13×A 31移动到处理元件PE 41中,控制PE 14采用加法器对上述元素乘积求和得到C 11,需要说明的是,控制器也可以控制第一行的处理元件将元素乘积移动到PE 11、PE 21或者PE 31中,本公开对此不作限定。控制器控制第二行和第三行的处理元件执行类似的操作后,可以得到第一中间结果C 11、C 12、C 13
针对第二矩阵中的每一行,重复以上过程可以得到第一中间结果:C 21、C 22、C 23和C 31、C 32、C 33。利用上述第一中间结果即可得到第一矩阵和第二矩阵的乘积
Figure PCTCN2021075957-appb-000006
在一种可能的实现方式中,对于得到的第一中间结果,可以按列存储即可得到第一矩阵和第二矩阵的乘积。
需要说明的是,以上示例中的处理元件的排列、输入矩阵等仅仅是为了清楚说明本公开运算方法的过程,不以任何方式限制本公开。
根据本公开上述各实施方式的矩阵乘的运算方法,对于满足处理元件的排列的任意规模的输入矩阵,可以得到矩阵乘法的运算结果。
对于不进行分块的情况,根据上述示例可以直接得到矩阵乘的结果。
根据本公开上述各实施方式的矩阵乘的运算方法,更适用于以阵列排布的处理元件组成的处理器,相比于相关技术中的矩阵乘运算可以减少访存次数,降低带宽压力,提高运算的效率。对于需要进行分块的情况,对于分块后的第一矩阵和第二矩阵(可以是分块得到的,也可以是直接将另一个矩阵作为第二矩阵),根据第一矩阵和对应的第二矩阵的乘积,按照矩阵乘的规则计算所述左乘矩阵和所述右乘矩阵的乘积。也就是说,可以将分块后得到的第一矩阵和第二矩阵作为矩阵的一个元素,按照矩阵乘的规则执行矩阵乘法的运算过程得到第二中间结果,根据第二中间结果进行计算可以得到所述输入矩阵的乘积。
图1-5示出根据本公开一实施例的分块的示意图。如图1-5所示,将矩阵D和E按照以上所述的方式进行分块得到第一矩阵D 11、D 12、D 21、D 22,以及第二矩阵E 11、E 12、E 21、E 22。可以将第一矩阵和第二矩阵作为矩阵的一个元素执行矩阵乘法的运算过程,例如,矩阵D第一行乘以矩阵E第一列为F 11=D 11×E 11+D 12×E 21,矩阵D第一行乘以矩阵E第二列为F 12=D 11×E 12+D 12×E 22,矩阵D第二行乘以矩阵E第一列为F 21=D 21×E 11+D 22×E 21,矩阵D第二行乘以矩阵E第二列为F 22=D 21×E 12+D 22×E 22。也就是说,为了得到最终的矩阵乘法的运算结果,需要先得到第二中间结果:
D 11×E 11,D 12×E 21,D 11×E 12,D 12×E 22
D 21×E 11,D 22×E 21,D 21×E 12,D 22×E 22
具体计算第二中间结果的过程可以通过将对应的第一矩阵和第二矩阵分别按照步骤S1-11-步骤S1-13的过程进行运算得到。
通过对输入矩阵进行分块,并针对分块后的矩阵分别进行本公开的矩阵乘法运算得到第二中间结果,利用矩阵乘的规则根据第二中间结果可以计算得到输入矩阵的乘积。根据本公开上述实施方式的运算方法,对于任何维度的矩阵都可以快速的实现矩阵相乘的过程,运算效率高。
对于进行分块的情况,如果处理元件包含的寄存器的数量可以满足存储输入矩阵的需求,那么还可以采用堆叠存储的方式将输入矩阵存储到处理元件的寄存器中,来实现输入矩阵的乘法运算。比如说,每个处理元件中可以包括多个寄存器,控制器可以将处理元件中的寄存器分为多组寄存器,那么,所述处理器包括多组寄存器,每组寄存器用于存储分块后的一个第一矩阵。因此,在一种可能的实现方式中,控制器可以根据对输入矩阵分块的方式对处理元件的寄存器进行分组得到多组寄存器。
在本实施方式中,本公开的运算方法还可以包括:
在对所述输入矩阵进行分块后,控制器在所述多组寄存器中堆叠存储所述两个以上第一矩阵,每组寄存器存储一个第一矩阵。
在另一种可能的实现方式中,控制器也可以每次存储一个第一矩阵,参照图1-5的示例,根据第二中间结果计算输入矩阵的乘积。
按照步骤S1-11-步骤S1-13的过程执行第一矩阵和与第一矩阵对应的第二矩阵的矩阵乘法运算得到第二中间结果,根据第二中间结果计算输入矩阵的乘积。其中,与第一矩阵对应的第二矩阵可以是指根据矩阵乘法规则左乘矩阵/右乘矩阵分块得到的矩阵中需要与第一矩阵进行乘法运算的矩阵。
示例1-3 堆叠存储结合步骤S1-11-步骤S1-13
举例来说,以处理元件为2×2的阵列,输入矩阵都为4×4矩阵为例对本公开的运算方法进行说明。
假设左乘矩阵
Figure PCTCN2021075957-appb-000007
右乘矩阵为
Figure PCTCN2021075957-appb-000008
那么,在一示例中,可以将左乘矩阵和右乘矩阵都划分为2×2的矩阵。需要说明的是,以上分块方式仅仅是本公开的一个示例,还可以采用其他方式进行分块,本公开对此不作限定。
图1-6示出根据本公开一实施例的对矩阵划分的示例。如图1-6所示,可以将左乘矩阵和右乘矩阵都划分为2×2的子矩阵,左乘矩阵划分后得到四个第一矩阵a 11、a 12、a 21、a 22,其中,a 11
Figure PCTCN2021075957-appb-000009
a 12
Figure PCTCN2021075957-appb-000010
a 21
Figure PCTCN2021075957-appb-000011
a 22
Figure PCTCN2021075957-appb-000012
右乘矩阵划分后得到四个第二矩阵b 11、b 12、b 21、b 22,其中,b 11
Figure PCTCN2021075957-appb-000013
b 12
Figure PCTCN2021075957-appb-000014
b 21
Figure PCTCN2021075957-appb-000015
b 22
Figure PCTCN2021075957-appb-000016
以采用步骤S1-11-步骤S1-13的过程计算第二中间结果为例,假设处理元件为2×2的阵列,以图1-6所示的示例为例,对于本公开的运算方法,可以加载第一矩阵,加载的结果如表1-1所示。其中,Reg0、Reg1、Reg2和Reg3分别表示处理元件中上的一组寄存器,处理元件为2×2的阵列,每个处理元件都包括多个寄存器,在进行数据存储时用位于同一组的寄存器存储一个第一矩阵如表1-1所示。
在一种可能的实现方式中,根据步骤S1-12的方式对第一矩阵和对应的第二矩阵进行处理:Reg0存储a 11、将b 11的第一列存储到a 11的第一行和第二行所在的处理元件的寄存器中,Reg1存储a 12、将b 21的第一列存储到a 12的第一行和第二行所在的处理元件的寄存器中,Reg2存储a 21、将b 12的第一列存储到a 21的第一行和第二行所在的处理元件的寄存器中,Reg3存储a 22、将b 22的第一列存储到a 22的第一行和第二行所在的处理元件的寄存器中,如表1-2所示。
然后处理器中的控制器控制处理元件分别对对应的寄存器内存储的元素求乘积得到元素乘积,然 后计算每一行元素乘积的和得到第一中间结果(具体过程,可以如上文的示例所述,不再赘述)。对于b 11、b 12、b 21、b 22的第二列,采用类似的方式进行存储并计算乘积得到元素乘积,按行求和得到第一中间结果。将第一中间结果进行处理可以得到第二中间结果a 11×b 11、a 12×b 21、a 21×b 12以及a 22×b 22
表1-1 元素存储示例
Figure PCTCN2021075957-appb-000017
表1-2 元素存储示例
Figure PCTCN2021075957-appb-000018
也就是说,在计算过程中,对于每一组寄存器内的元素,控制器可以控制处理元件计算得到第二中间结果a 11×b 11、a 12×b 21、a 21×b 12以及a 22×b 22。具体过程不再赘述。根据第二中间结果a 11×b 11、a 12×b 21、a 21×b 12以及a 22×b 22,控制器可以控制处理元件计算得到C 11=a 11×b 11+a 12×b 21,C 22=a 21×b 12+a 22×b 22
根据以上过程,控制器还可以控制处理元件根据步骤S1-11-步骤S1-13的过程计算得到第二中间结果a 11×b 12、a 12×b 22、a 21×b 11以及a 22×b 21:将b 11的第一列存储到a 21的第一行和第二行所在的处理元件的寄存器中,将b 21的第一列存储到a 22的第一行和第二行所在的处理元件的寄存器中,将b 12的第一列存储到a 11的第一行和第二行所在的处理元件的寄存器中,将b 22的第一列存储到a 12的第一行和第二行所在的处理元件的寄存器中,然后处理器中的控制器控制处理元件分别对对应的寄存器内存储的元素求乘积得到元素乘积,然后计算每一行元素乘积的和得到第一中间结果;对b 11、b 12、b 21、b 22的第二列,采用类似的方式进行存储并计算乘积,按行求和得到第一中间结果,将第一中间结果进行处理可以得到第二中间结果a 11×b 12、a 12×b 22、a 21×b 11以及a 22×b 21。根据第二中间结果a 11×b 12、a 12×b 22、a 21×b 11以及a 22×b 21可以计算得到C 12=a 11×b 12+a 12×b 22,C 21=a 21×b 11+a 22×b 21
在另一种可能的实现方式中,如表1-3所示,在步骤S1-12中,控制器还可以先将b 11的第一列存储到a 11的第一行和第二行所在的处理元件的寄存器中、a 21的第一行和第二行所在的处理元件的寄存器中,将b 21的第一列存储到a 12的第一行和第二行所在的处理元件的寄存器中、a 22的第一行和第二行所 在的处理元件的寄存器中。
表1-3 元素存储示例
Figure PCTCN2021075957-appb-000019
对于表1-3的示例,处理器中的控制器控制处理元件分别对对应的寄存器内存储的元素求乘积得到元素乘积,然后计算每一行元素乘积的和得到第一中间结果。对于b 11、b 21的第二列,采用类似的方式进行存储并计算乘积得到元素乘积,按行求和得到第一中间结果。控制器可以控制处理元件根据第一中间结果计算得到第二中间结果a 11×b 11、a 12×b 21、a 21×b 11以及a 22×b 21
对于b 12、b 22也可以重复上述过程得到第二中间结果a 11×b 12、a 12×b 22、a 21×b 12以及a 22×b 22。具体过程不再赘述。
根据第二中间结果可以计算得到输入矩阵的乘积。
根据以上过程,可以采用分块的方式计算得到输入矩阵的乘积。因此,根据本公开的矩阵乘的运算方法可以实现任意大小规模的矩阵运算。并且,相比于相关技术中的矩阵乘运算可以减少访存次数,降低带宽压力,提高运算的效率。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作处理,但是本领域技术人员应该知悉,本公开并不受所描述的动作顺序的限制,因为依据本公开,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本公开所必须的。
进一步需要说明的是,虽然流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
本公开还提供了一种处理器。图1-1所示为处理器的一个示例,处理器可以包括两个以上处理元件,两个以上处理元件以二维矩阵排列,每个处理元件包括至少一个寄存器,所述处理器用于实现对第一矩阵和第二矩阵的矩阵乘法运算。
在一种可能的实现方式中,所述处理器还包括控制器,所述控制器用于将第一矩阵加载到处理元件的寄存器中;
针对第二矩阵的每一行,所述控制器用于将所述每一行中的元素存储到第一矩阵的每一列元素存储的处理元件的寄存器,与第一矩阵的每一列中的元素分别求乘积,计算一列乘积的和得到第一中间 结果;或者,针对第二矩阵的每一列,所述控制器用于将所述每一列中的元素存储到第一矩阵的每一行元素存储的处理元件的寄存器,与第一矩阵的每一行中的元素分别求乘积,计算一行乘积的和得到第一中间结果;
所述控制器还用于将第一中间结果进行处理得到第一矩阵和第二矩阵的乘积。
其中的第一矩阵可以是对待加载矩阵分块后得到的多个第一矩阵中的一个,待加载矩阵可以为左乘矩阵或者右乘矩阵。输入矩阵中除了待加载矩阵以外的另一个矩阵为第二矩阵。
第一矩阵也可以不是分块后的矩阵,例如,第一矩阵可以为输入矩阵中的左乘矩阵或者右乘矩阵,第二矩阵为输入矩阵中的另一个矩阵。
也就是说,在一种可能的实现方式中,本公开的处理器的控制器还可以根据处理元件的排列,从输入矩阵中确定不需要进行分块的矩阵为第一矩阵,输入矩阵中的另一矩阵为第二矩阵,输入矩阵包括左乘矩阵和右乘矩阵。
在一种可能的实现方式中,第一矩阵为左乘矩阵、第二矩阵为右乘矩阵,针对第二矩阵中的每一列元素,所述控制器用于将该列元素中的每个元素存储至第一矩阵中对应的一列元素存储的处理元件的寄存器,控制每一个处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,计算每一行元素乘积的和得到第一中间结果,其中,第一矩阵中与所述每个元素对应的一列元素是指,该元素在所述第二矩阵中的行数与一列元素的列数相同。
在另一种可能的实现方式中,第一矩阵为右乘矩阵、第二矩阵为左乘矩阵,针对第二矩阵中的每一行元素,所述控制器用于将该行元素中的每个元素存储至第一矩阵中对应的一行元素存储的处理元件的寄存器,控制每一个处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,计算每一列元素乘积的和得到第一中间结果,其中,第一矩阵中与所述每个元素对应的一行元素是指,该元素在所述第二矩阵中的列数与一行元素所在的行数相同。
针对上述两种实施方式,对于不分块的具体的示例可以参见上文运算方法部分的描述,不再赘述。
在另一种可能的实现方式中,控制器还用于从输入矩阵中确定待加载矩阵;其中,输入矩阵包括左乘矩阵和右乘矩阵,待加载矩阵为左乘矩阵或右乘矩阵;根据处理元件的排列以及待加载矩阵的行秩以及列秩确定是否对待加载矩阵进行分块;若要对待加载矩阵进行分块,则所述控制器用于根据待处理元件的排列以及待加载矩阵的行秩以及列秩对待加载矩阵进行分块得到两个以上第一矩阵。
在该实施方式中,所述控制器还用于根据对待加载矩阵分块的方式,对输入矩阵中除了待加载矩阵以外的另一个矩阵进行分块得到两个以上第二矩阵;在该实施方式中,所述处理器包括多组寄存器,在对所述输入矩阵进行分块后,所述控制器还用于在所述多组寄存器中堆叠存储所述两个以上第一矩阵,每组存储一个第一矩阵。在该实施方式中,控制器还可以根据第一矩阵和对应的第二矩阵的乘积,按照矩阵乘的规则计算所述左乘矩阵和所述右乘矩阵的乘积。
针对上述分块的具体示例,可以参见上文中关于图1-5和图1-6部分的描述,不再赘述。
本公开实施例还提出一种人工智能芯片,所述芯片包括如上所述的处理器。
在一种可能的实现方式中,还公开了一种板卡,其包括存储器件、接口装置和控制器件以及上述人工智能芯片;其中,所述人工智能芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;所述存储器件,用于存储数据;所述接口装置,用于实现所述人工智能芯片与外部设备之间的数据传 输;所述控制器件,用于对所述人工智能芯片的状态进行监控。
依据以下条款可更好地理解前述内容:
条款A1.一种基于处理元件矩阵的矩阵乘的运算方法,应用于处理器,所述处理器包括两个以上处理元件,所述两个以上处理元件以二维矩阵排列,处理元件包括至少一个寄存器,所述方法实现对第一矩阵和第二矩阵的矩阵乘法运算,
所述方法包括:
将第一矩阵加载到处理元件的寄存器中;
针对第二矩阵的每一行,将所述每一行中的元素存储到第一矩阵的每一列元素存储的处理元件的寄存器,与第一矩阵的每一列中的元素分别求乘积,计算一列乘积的和得到第一中间结果;或者,针对第二矩阵的每一列,将所述每一列中的元素存储到第一矩阵的每一行元素存储的处理元件的寄存器,与第一矩阵的每一行中的元素分别求乘积,计算一行乘积的和得到第一中间结果;
将第一中间结果进行处理得到第一矩阵和第二矩阵的乘积。
条款A2.根据条款A1所述的方法,第一矩阵为左乘矩阵、第二矩阵为右乘矩阵,
针对第二矩阵中的每一列元素,将该列元素中的每个元素存储至第一矩阵中对应的一列元素存储的处理元件的寄存器,控制每一个处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,计算每一行元素乘积的和得到第一中间结果,
其中,第一矩阵中与所述每个元素对应的一列元素是指,该元素在所述第二矩阵中的行数与一列元素的列数相同。
条款A3.根据条款A1所述的方法,第一矩阵为右乘矩阵、第二矩阵为左乘矩阵,
针对第二矩阵中的每一行元素,将该行元素中的每个元素存储至第一矩阵中对应的一行元素存储的处理元件的寄存器,控制每一个处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,计算每一列元素乘积的和得到第一中间结果,
其中,第一矩阵中与所述每个元素对应的一行元素是指,该元素在所述第二矩阵中的列数与一行元素所在的行数相同。
条款A4.根据条款A1-A3任意一项所述的方法,所述方法还包括:
根据处理元件的排列,从输入矩阵中确定不需要进行分块的矩阵为第一矩阵,输入矩阵中的另一矩阵为第二矩阵。
条款A5.根据条款A1-A3任意一项所述的方法,所述方法还包括:
从输入矩阵中确定待加载矩阵;其中,输入矩阵包括左乘矩阵和右乘矩阵,待加载矩阵为左乘矩阵或右乘矩阵;
根据处理元件的排列以及待加载矩阵的行秩以及列秩确定是否对待加载矩阵进行分块;其中,待加载矩阵为左乘矩阵或右乘矩阵;
若要对待加载矩阵进行分块,则根据待处理元件的排列以及待加载矩阵的行秩以及列秩对待加载矩阵进行分块得到两个以上第一矩阵。
条款A6.根据条款A5所述的方法,所述方法还包括:
根据对待加载矩阵分块的方式,对输入矩阵中除了待加载矩阵以外的另一个矩阵进行分块得到两 个以上第二矩阵;
根据第一矩阵和对应的第二矩阵的乘积,按照矩阵乘的规则计算所述左乘矩阵和所述右乘矩阵的乘积。
条款A7.根据条款A5所述的方法,所述处理器包括多组寄存器,所述方法还包括:
在对所述输入矩阵进行分块后,在所述多组寄存器中堆叠存储所述两个以上第一矩阵,每组存储一个第一矩阵。
条款A8.一种处理器,所述处理器包括两个以上处理元件,所述两个以上处理元件以二维矩阵排列,处理元件包括至少一个寄存器,所述处理器用于对第一矩阵和第二矩阵执行矩阵乘法运算,
所述处理器还包括控制器,所述控制器用于将第一矩阵加载到处理元件的寄存器中;
针对第二矩阵的每一行,所述控制器用于将所述每一行中的元素存储到第一矩阵的每一列元素存储的处理元件的寄存器,与第一矩阵的每一列中的元素分别求乘积,计算一列乘积的和得到第一中间结果;或者,针对第二矩阵的每一列,所述控制器用于将所述每一列中的元素存储到第一矩阵的每一行元素存储的处理元件的寄存器,与第一矩阵的每一行中的元素分别求乘积,计算一行乘积的和得到第一中间结果;
所述控制器还用于将第一中间结果进行处理得到第一矩阵和第二矩阵的乘积。
条款A9.根据条款A8所述的处理器,第一矩阵为左乘矩阵、第二矩阵为右乘矩阵,
针对第二矩阵中的每一列元素,所述控制器用于将该列元素中的每个元素存储至第一矩阵中对应的一列元素存储的处理元件的寄存器,控制每一个处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,计算每一行元素乘积的和得到第一中间结果,
其中,第一矩阵中与所述每个元素对应的一列元素是指,该元素在所述第二矩阵中的行数与一列元素的列数相同。
条款A10.根据条款A8所述的处理器,第一矩阵为右乘矩阵、第二矩阵为左乘矩阵,
针对第二矩阵中的每一行元素,所述控制器用于将该行元素中的每个元素存储至第一矩阵中对应的一行元素存储的处理元件的寄存器,控制每一个处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,计算每一列元素乘积的和得到第一中间结果,
其中,第一矩阵中与所述每个元素对应的一行元素是指,该元素在所述第二矩阵中的列数与一行元素所在的行数相同。
条款A11.根据条款A8-A10任意一项所述的处理器,所述处理器还用于根据处理元件的排列,从输入矩阵中确定不需要进行分块的矩阵为第一矩阵,输入矩阵中的另一矩阵为第二矩阵,输入矩阵包括左乘矩阵和右乘矩阵。
条款A12.根据条款A8-A10任意一项所述的处理器,所述控制器还用于从输入矩阵中确定待加载矩阵;其中,输入矩阵包括左乘矩阵和右乘矩阵,待加载矩阵为左乘矩阵或右乘矩阵;根据处理元件的排列以及待加载矩阵的行秩以及列秩确定是否对待加载矩阵进行分块;
若要对待加载矩阵进行分块,则所述控制器用于根据待处理元件的排列以及待加载矩阵的行秩以及列秩对待加载矩阵进行分块得到两个以上第一矩阵。
条款A13.根据条款A12所述的处理器,所述控制器还用于根据对待加载矩阵分块的方式,对输入 矩阵中除了待加载矩阵以外的另一个矩阵进行分块得到两个以上第二矩阵;根据第一矩阵和对应的第二矩阵的乘积,按照矩阵乘的规则计算所述左乘矩阵和所述右乘矩阵的乘积。
条款A14.根据条款A12所述的处理器,所述处理器包括多组寄存器,在对所述输入矩阵进行分块后,所述控制器还用于在所述多组寄存器中堆叠存储所述两个以上第一矩阵,每组存储一个第一矩阵。
条款A15.一种人工智能芯片,所述芯片包括如条款A8-A14中任意一项所述的处理器。
条款A16.一种电子设备,包括如条款A15所述的人工智能芯片。
在利用人工智能对信息进行处理的过程中,矩阵运算占用比较大的计算量,并且现有的处理器在处理矩阵运算的过程中把矩阵运算拆解成乘法运算和加法运算,需要频繁的从内存中读取数据,运算的效率很低。
为了解决上述技术问题,本公开提供了一种运算方法以及执行该运算方法的处理器。处理器可以包括多个处理元件(两个以上),这些处理元件可以以二维矩阵的形式排列,每个处理元件可以包括至少一个寄存器。
图2-1示出根据本公开一实施例的处理器的示意图。如图2-1所示,多个处理元件PE(Processing Element)以二维矩阵的形式排列,每个处理元件与相邻的处理元件之间连接,每个PE中可以设置有至少一个寄存器(register)(图中未示出)。处理器还可以包括控制器和存储器,其中,控制器和存储器都与多个处理元件连接,且控制器可以连接存储器。所述控制器用于从存储器中加载输入数据到处理元件的寄存器中,并控制处理元件对输入数据进行处理,比如说,存储器中可以存储有第一矩阵和第二矩阵,处理器用于对第一矩阵和第二矩阵执行矩阵乘法运算,因此,控制器可以将第一矩阵和第二矩阵加载到处理元件的寄存器中,并控制处理元件执行矩阵乘法运算。
在一种可能的实现方式中,存储器中还可以存储有可执行程序,可执行程序中可以包括指令,执行指令可以实现对第一矩阵和第二矩阵的矩阵乘法运算。控制器中可以设置有加载器、译码器等,其中,加载器可以用于将存储器中的输入数据加载到处理元件的寄存器中,译码器可以根据加载后输入数据的存储地址对可执行程序中访问数据的指令进行译码,比如说,对于访问数据的指令,通过译码获得数据在寄存器中存储的地址赋值给访问数据的指令,并将译码后的指令发送给处理元件,由处理元件执行指令,从而实现对数据的处理,比如说实现对第一矩阵和第二矩阵的矩阵乘法运算。
在一种可能的实现方式中,存储器可以为片上缓存,控制器可以将片外闪存上的可执行程序以及输入数据(例如,输入矩阵,包括左乘矩阵和右乘矩阵)加载到上述存储器(片上缓存)中,再进行之后的矩阵乘法运算的过程。
在一种可能的实现方式中,控制器也可以直接从片外内存上加载输入矩阵以及可执行程序到处理元件的寄存器中,本公开对此不作限定。
PE中还可以包括运算器以完成指定的运算,以矩阵运算为例,PE中可以包括例如乘法器、加法器等,各个PE中的具体结构可以相同,也可以存在不同,本公开对此不作限定。PE中还可以包括其他类型的运算器,以适应各种不同的运算过程,本公开对PE包括的运算器的数量和类型不作限定。
矩阵乘法运算的输入矩阵可以包括左乘矩阵和右乘矩阵,其中,左乘矩阵可以是指位于乘号左边的矩阵,右乘矩阵可以是指位于乘号右边的矩阵。
由于处理器中PE的数量以及排列方式是固定的,因此,在向处理元件中的寄存器中加载数据并计算之前,控制器可以根据处理元件的排列以及输入矩阵的行秩以及列秩确定是否对输入矩阵进行分块。处理元件的排列可以是指处理元件的行数和列数,输入矩阵的行秩、列秩可以是指左乘矩阵以及右乘 矩阵的行数和列数。
控制器根据处理元件的排列以及输入矩阵的行秩以及列秩确定是否对输入矩阵进行分块可以是指:控制器判断输入矩阵或者输入矩阵的转置的行数是否大于处理元件的行数、列数是否大于处理元件的列数,根据判断的结果确定是否对输入矩阵进行分块。
如果输入矩阵中的一个矩阵的行数不大于处理元件的行数、且列数不大于处理元件的列数,而且,输入矩阵中的另一个矩阵的转置的行数不大于处理元件的行数、且列数不大于处理元件的列数,则可以不对输入矩阵进行分块。
如果输入矩阵中的任意一个矩阵的行数大于处理元件的行数、或者列数大于处理元件的列数,或者,输入矩阵中的任意一个矩阵的转置的行数大于处理元件的行数、或者列数大于处理元件的列数,则控制器可以对输入矩阵进行分块。
举例来说,假设处理元件组成的阵列可以表示为PE MN,表示处理元件组成一个M×N的矩阵,M表示矩阵的行数,N表示矩阵的列数,假设一个输入矩阵为A mn,表示m×n的矩阵,m代表矩阵的行数,n代表矩阵的列数,另一个输入矩阵为B nk,表示n×k的矩阵,n代表矩阵的行数,k代表矩阵的列数。如果矩阵A mn的行数m不大于处理元件的行数M、且列数n不大于处理元件的列数N,而且,B nk的转置矩阵
Figure PCTCN2021075957-appb-000020
的行数k不大于处理元件的行数M、且列数n不大于处理元件的列数N,则可以不对输入矩阵进行分块。或者说,如果A mn的转置矩阵
Figure PCTCN2021075957-appb-000021
的行数n不大于处理元件的行数M、且列数m不大于处理元件的列数N,而且,B nk的行数n不大于处理元件的行数M、且列数k不大于处理元件的列数N,则可以不对输入矩阵进行分块。
如果矩阵A mn的行数m大于处理元件的行数M、或者列数n大于处理元件的列数N,或者矩阵B nk的转置
Figure PCTCN2021075957-appb-000022
的行数k大于处理元件的行数M、或列数n大于处理元件的列数N,则可以对输入矩阵进行分分块;或者,如果
Figure PCTCN2021075957-appb-000023
的行数n大于处理元件的行数M、或列数m大于处理元件的列数N,或者,B nk的行数n大于处理元件的行数M、或列数k大于处理元件的列数N,则可以对输入矩阵进行分块。
若要对输入矩阵中的一个矩阵进行分块,控制器可以根据处理元件的排列对左乘矩阵的行进行拆分或者对右乘矩阵的列进行拆分。
举例来说,假设处理元件组成的阵列为PE 22,左乘矩阵为A 32,右乘矩阵为B 22,那么可以将A 32拆分为A 12、A 22分别与B 22相乘。若左乘矩阵为A 22、右乘矩阵为B 32,那么可以将B 32拆分为B 12、B 22
若要对输入矩阵中的两个矩阵都进行分块,控制器可以根据处理元件的排列以及输入矩阵的行秩和列秩对左乘矩阵列方向和右乘矩阵行方向以相同的方式进行分块。
也就是说,可以对左乘矩阵和转置后的右乘矩阵在列方向上以相同的方式进行分块,或者将转置后的左乘矩阵和右乘矩阵在行方向上以相同的方式进行分块,其中,所述相同的方式划分指的是划分后所得的第一矩阵和第二矩阵的列数或者行数是相同的,以保证能正常完成矩阵运算。
假设对左乘矩阵分块后可以得到两个以上第一矩阵,对右乘矩阵分块后可以得到两个以上第二矩阵,或者,对右乘矩阵分块后可以得到两个以上第一矩阵,对左乘矩阵分块后可以得到两个以上第二矩阵。
根据处理元件的排列以及输入矩阵的行秩和列秩对左乘矩阵列方向和右乘矩阵行方向以相同的方式进行分块,分块后得到的第一矩阵和第二矩阵都需要满足不需要再进行分块的条件,也就是说,第一矩阵和第二矩阵的转置行数不大于处理元件的行数、且列数不大于处理元件的列数,或者,第一矩阵的转置和第二矩阵的行数不大于处理元件的行数、且列数不大于处理元件的列数。
在一种可能的实现方式中,控制器可以按照划分出的第一矩阵或者第二矩阵的行秩和列秩尽量接 近处理元件的行数和列数的方式进行划分,这样可以提高运算的效率,缩短运算时间。也就是说,假设处理元件为4×4的阵列,那么可以先按照划分出的矩阵为4×4的方式进行划分,这样可以最大效率的利用处理元件,提高运算效率。
举例来说,假设处理元件为2×2的阵列,输入矩阵一个为2×4矩阵、一个为4×3矩阵。划分的方式可以有很多种,图2-2a和图2-2b分别示出了多种不同的划分方式,矩阵A 24在列方向和矩阵B 43在行方向以相同的方式进行分块。图2-2a是划分的一个示例,矩阵A 24在列方向划分为两部分,每一部分包括两列,矩阵B 43在行方向划分为两部分,每一部分包括两行;图2-2b是划分的另一个示例,矩阵A 24在列方向划分为三部分,其中一部分包括两列、另外两部分都包括一列,矩阵B 43在行方向划分为三部分,其中一部分包括两行、另外两部分都包括一行。以上处理元件的排列以及输入矩阵的划分方式仅仅是本公开的一个示例,不以任何方式限制本公开。
对于左乘矩阵的行方向和右乘矩阵的列方向的划分方式,本公开不作具体的限定,只要划分后的矩阵都需要满足不需要再进行分块的条件即可。
根据矩阵乘法的运算规则,左乘矩阵的行中的元素与右乘矩阵的列中的元素逐个求乘积、然后求和。因此,在一种可能的实现方式中,对于不分块的情况,或者分块后的第一矩阵和对应的第二矩阵,所述控制器用于将第一矩阵的转置矩阵和第二矩阵的各元素分别加载到各处理元件的寄存器中,转置矩阵和第二矩阵对应位置的元素存储在同一处理元件的寄存器中。按照矩阵乘法规则,转置矩阵和第二矩阵对应位置的元素可以是指转置矩阵中和第二矩阵中需要进行乘法运算的元素。
在一种可能的实现方式中,控制器可以先对第一矩阵进行转置得到转置矩阵,然后将转置矩阵的元素加载到各处理元件的寄存器中,或者,在另一种可能的实现方式中,控制器也可以在加载的过程中实现对第一矩阵的转置,比如说,假设第一矩阵为右乘矩阵,那么控制器在将第一矩阵元素加载到各处理元件的寄存器的过程中,可以将第一矩阵的一列元素加载到一行处理元件的寄存器中实现对第一矩阵的转置。
在一种可能的实现方式中,转置矩阵和第二矩阵在行或者列方向对齐。具体地,如果对左乘矩阵转置,那么,加载后,第一矩阵的转置矩阵的行与第二矩阵在列方向对齐,也就是在列的方向上,转置矩阵和第二矩阵的行对齐;如果对右乘矩阵转置,那么加载后,转置矩阵的列与第二矩阵在行方向对齐,也就是说,在行的方向上,转置矩阵和第二矩阵的列对齐。
在加载完转置矩阵和第二矩阵后,所述控制器还用于控制所述转置矩阵或者第二矩阵中的元素在行方向或者列方向滚动,控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积、将同一行或同一列的元素乘积求和得到第一中间结果。具体地,控制器控制处理元件、存储在寄存器内的转置矩阵和第二矩阵重复以下过程,直到转置矩阵或第二矩阵中的元素恢复到未滚动时的位置:控制器控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,对同一行或者同一列的元素乘积求和得到第一中间结果,控制存储在寄存器中的转置矩阵或第二矩阵在行方向或列方向滚动一行或一列。
也就是说,先控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,对同一行或者同一列的元素乘积求和得到第一中间结果,然后控制转置矩阵或第二矩阵中的元素在行方向或列方向滚动一行或一列,此时可以判断滚动完之后转置矩阵或者第二矩阵中的元素与初始位置是否相同,其中,初始位置可以是指转置矩阵或第二矩阵中的元素未滚动时的位置。若判断结果为相同,那么,结束此过程。若判断结果为不同,那么再控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,对同一行或者同一列的元素乘积求和得到第一中间结果,然后控制转置矩阵或第二矩阵中的元 素在行方向或列方向滚动一行或一列,判断滚动完之后转置矩阵或者第二矩阵中的元素与初始位置是否相同……,循环上述过程直到滚动完之后转置矩阵或者第二矩阵中的元素与初始位置相同。
在一个示例中,所述第一矩阵为左乘矩阵、第二矩阵为右乘矩阵。在另一个示例中,所述第一矩阵为右乘矩阵、第二矩阵为左乘矩阵。
在第一矩阵为左乘矩阵、第二矩阵为右乘矩阵时,控制器控制转置矩阵中的元素在行方向上滚动,或者控制第二矩阵的元素在行方向上滚动,控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,对同一列的元素乘积求和得到第一中间结果。
在第一矩阵为右乘矩阵、第二矩阵为左乘矩阵时,控制器控制转置矩阵中的元素在列方向上滚动、或者控制第二矩阵中的元素在列方向上滚动;控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,对同一行的元素乘积求和得到第一中间结果。
在一种可能的实现方式中,上述的滚动,每次滚动一行或者一列。在存储有矩阵的元素的处理元件之间形成闭环,由于相邻的处理元件之间是连接在一起的,因此控制器可以根据矩阵的维度确定成环的方式,比如说,如果要按行滚动(在列方向滚动),那么,存储有矩阵的元素的第一行处理元件和最后一行处理元件连接起来,在滚动的过程中,如果向上滚动一行,那么矩阵的第一行元素从原来存储的位置滚动到最后一行元素存储的位置。若要按列滚动(在行方向上滚动),那么,存储有矩阵的元素的第一列处理元件和最后一列处理元件连接起来,在滚动的过程中,如果向左滚动一列,那么矩阵的第一列元素从原来存储的位置滚动到最后一列元素存储的位置。上述的处理元件与处理元件的连接可以是指虚拟的连接,也就是说,并没有实际的连接线路,而是控制器记录了对应的处理器,在滚动的过程中形成闭环即可。
在转置矩阵或第二矩阵中的元素恢复到未滚动时的位置时,完成滚动和计算第一中间结果的过程之后,控制器可以对所述第一中间结果进行处理得到第一矩阵和第二矩阵的乘积。
在一种可能的实现方式中,控制器将第一中间结果按行或者按列存储,在行方向或者列方向进行滚动后得到第一矩阵和第二矩阵的乘积。具体的处理方式与进行转置的矩阵和滚动的方向有关,比如说:
在第一矩阵为右乘矩阵、第二矩阵为左乘矩阵时,对转置矩阵在列方向上向上滚动的情况下,可以将第一中间结果按列存储,并将第一中间结果中的元素在行方向上向右滚动;比如,第i行元素在行方向向右滚动i-1步;
在第一矩阵为右乘矩阵、第二矩阵为左乘矩阵时,对转置矩阵在列方向上向下滚动的情况下,可以将第一中间结果按列存储,并将第一中间结果中的元素在行方向上向左滚动;比如,第i行元素在行方向向左滚动i-1步;
在第一矩阵为左乘矩阵、第二矩阵为右乘矩阵时,对转置矩阵在行方向向左滚动的情况下,可以将第一中间结果按行存储,将第一中间结果中第i列元素在列方向向下滚动i-1步得到输入矩阵的乘积;
在第一矩阵为左乘矩阵、第二矩阵为右乘矩阵时,对转置矩阵在行方向向右滚动的情况下,可以将第一中间结果按行存储,将第一中间结果中第i列元素在列方向向上滚动i-1步得到输入矩阵的乘积。
相关技术中,对于输入矩阵规模比较大的矩阵乘法,为了提高矩阵运算的效率,通常采用多级流水线的方式实现运算的过程,但多级流水线由于每一级对输入数据中的一部分进行处理,因此,需要频繁的从内存中读取数据,频繁访问内存导致对带宽的要求较高。为了解决上述技术问题,本公开提供的处理器可以对输入矩阵进行分块后堆叠存储,同时对分块后对应的矩阵进矩阵乘法运算,可以降低访存频率,提高运算效率。
若第一矩阵是根据左乘矩阵进行分块得到的,或第二矩阵是根据右乘矩阵分块后得到的,那么,在一种可能的实现方式中,控制器还用于根据第一矩阵和第二矩阵的乘积计算左乘矩阵和右乘矩阵的乘积。也就是说,对于分块后的第一矩阵和对应的第二矩阵分别计算第一矩阵和第二矩阵的乘积,然后根据第一矩阵和第二矩阵的乘积计算左乘矩阵和右乘矩阵的乘积。这样可以降低访存频率,提高运算效率。
在另一种可能的实现方式中,所述处理器包括多组寄存器。也就是说,控制器可以根据对矩阵分块的情况,将处理元件的寄存器分为多个组。
这样,所述控制器可以在对所述输入矩阵进行分块后,将两个以上所述第一矩阵进行转置得到转置矩阵;控制器将转置矩阵、和两个以上所述第二矩阵加载到所述多组寄存器中堆叠存储,一组寄存器中存储有对应位置的转置矩阵和第二矩阵。
在每次对转置矩阵或第二矩阵中的元素在行方向或列方向滚动一次之前,控制器控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,对同一行或者同一列的元素乘积求和得到第一中间结果;在控制一组寄存器中的元素在行或列方向上滚动一行或一列转置矩阵之后,控制器还对滚动结果进行修正。
在一种可能的实现方式中,对滚动结果进行修正包括:
若在行方向上向左滚动,则修正的方式为,将滚动之后每一块转置矩阵内最后一列数据滚动到相邻的前一块转置矩阵数据的最后一列;
若在行方向上向右滚动,则修正的方式为,将滚动之后每一块转置矩阵内第一列数据滚动到相邻的后一块转置矩阵数据的第一列;
若在列方向上向上滚动,则修正的方式为,将滚动之后每一块转置矩阵内最后一行数据滚动到相邻的前一块转置矩阵数据的最后一行;
若在列方向上向下滚动,则修正的方式为,将滚动之后每一块转置矩阵内第一行数据滚动到相邻的后一块转置矩阵数据的第一行;
其中,每一块转置矩阵是指对分块之后的每一块矩阵进行转置之后的矩阵。具体的计算和修正过程将在下文的示例中详细介绍。
本公开还提供了一种运算方法,用于实现矩阵乘法运算。
对于不分块的情况,或者分块后的第一矩阵和第二矩阵,图2-3示出根据本公开一实施例的运算方法的流程图。对于不分块的情况,也可以直接把左乘矩阵作为第一矩阵、右乘矩阵作为第二矩阵,或者直接把左乘矩阵作为第二矩阵、右乘矩阵作为第一矩阵,本公开对此不作限定。
如图2-3所示,本公开提供的运算方法可以包括以下步骤:
步骤S2-11,将第一矩阵进行转置得到转置矩阵,将转置矩阵和第二矩阵加载到处理元件的寄存器中,转置矩阵和第二矩阵对应位置的元素存储在同一处理元件的寄存器中。
按照矩阵乘法规则,转置矩阵和第二矩阵对应位置的元素可以是指转置矩阵中和第二矩阵中需要进行乘法运算的元素。
在一种可能的实现方式中,转置矩阵和第二矩阵在行或者列方向对齐。具体地,如果对左乘矩阵转置,那么,加载后,第一矩阵的转置矩阵的行与第二矩阵在列方向对齐,也就是在列的方向上,转置矩阵和第二矩阵的行对齐;如果对右乘矩阵转置,那么加载后,转置矩阵的列与第二矩阵在行方向对齐,也就是说,在行的方向上,转置矩阵和第二矩阵的列对齐。
步骤S2-12,控制所述转置矩阵或者第二矩阵在行方向或者列方向滚动,控制处理元件对相应的 寄存器内的元素进行乘法运算得到元素乘积、将同一行或同一列的元素乘积求和得到第一中间结果。
在一种可能的实现方式中,步骤S2-12具体可以包括,重复以下过程直到转置矩阵或第二矩阵中的元素恢复到未滚动时的位置:控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,对同一行或者同一列的元素乘积求和得到第一中间结果;在处理元件的矩阵中对转置矩阵或第二矩阵在行方向或列方向滚动一行或一列。
步骤S2-13,将所述第一中间结果进行处理得到所述第一矩阵和第二矩阵的乘积。
也就是说,对于步骤S2-12和步骤S2-13,先控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,对同一行或者同一列的元素乘积求和得到第一中间结果,然后控制转置矩阵或第二矩阵中的元素在行方向或列方向滚动一行或一列,此时可以判断滚动完之后转置矩阵或者第二矩阵中的元素与初始位置是否相同,其中,初始位置可以是指转置矩阵或第二矩阵中的元素未滚动时的位置。若判断结果为相同,那么,结束此过程,继续执行步骤S2-13。若判断结果为不同,那么再控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,对同一行或者同一列的元素乘积求和得到第一中间结果,然后控制转置矩阵或第二矩阵中的元素在行方向或列方向滚动一行或一列,判断滚动完之后转置矩阵或者第二矩阵中的元素与初始位置是否相同……,循环上述过程直到滚动完之后转置矩阵或者第二矩阵中的元素与初始位置相同。
在一个示例中,所述第一矩阵为左乘矩阵、第二矩阵为右乘矩阵。在另一个示例中,所述第一矩阵为右乘矩阵、第二矩阵为左乘矩阵。
在第一矩阵为左乘矩阵、第二矩阵为右乘矩阵时,步骤S2-12中控制转置矩阵中的元素在行方向上滚动,或者控制第二矩阵中的元素在行方向上滚动,控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,对同一列的元素乘积求和得到第一中间结果。
在第一矩阵为右乘矩阵、第二矩阵为左乘矩阵时,步骤S2-12中,控制转置矩阵中的元素在列方向上滚动、或者控制第二矩阵中的元素在列方向上滚动,控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,对同一行的元素乘积求和得到第一中间结果。
在一种可能的实现方式中,上述的滚动,每次滚动一行或者一列。
对于步骤S2-13,对第一中间结果进行处理可以是指:将第一中间结果按行或者按列存储,在行方向或者列方向进行滚动后得到第一矩阵和第二矩阵的乘积。具体的处理方式与进行转置的矩阵和滚动的方向有关,比如说:
在第一矩阵为右乘矩阵、第二矩阵为左乘矩阵时,对转置矩阵在列方向上向上滚动的情况下,可以将第一中间结果按列存储,将第一中间结果中的元素在行方向上向右滚动;比如,第i行元素在行方向向右滚动i-1步;
在第一矩阵为右乘矩阵、第二矩阵为左乘矩阵时,对转置矩阵在列方向上向下滚动的情况下,可以将第一中间结果按列存储,将第一中间结果中的元素在行方向上向左滚动;比如,第i行元素在行方向向左滚动i-1步;
在第一矩阵为左乘矩阵、第二矩阵为右乘矩阵时,对转置矩阵在行方向向左滚动的情况下,可以将第一中间结果按行存储,将第一中间结果中第i列元素在列方向向下滚动i-1步得到输入矩阵的乘积;
在第一矩阵为左乘矩阵、第二矩阵为右乘矩阵时,对转置矩阵在行方向向右滚动的情况下,可以将第一中间结果按行存储,将第一中间结果中第i列元素在列方向向上滚动i-1步得到输入矩阵的乘积。
下面将分别以第一矩阵为右乘矩阵、第二矩阵为左乘矩阵,和,第一矩阵为左乘矩阵、第二矩阵为右乘矩阵为例对步骤S2-11-步骤S2-13的过程进行说明。
示例2-1 第一矩阵为右乘矩阵、第二矩阵为左乘矩阵,也就是说,对右乘矩阵进行转置。
假设第一矩阵b nk和第二矩阵a mn都为3×3矩阵,处理元件组成4×4的阵列。
图2-4示出根据本公开一实施例的处理元件组成的阵列的示意图。结合图2-4以及图2-3对本公开的运算方法进行说明。
假设第一矩阵
Figure PCTCN2021075957-appb-000024
第二矩阵
Figure PCTCN2021075957-appb-000025
那么对第一矩阵进行转置得到的转置矩阵为
Figure PCTCN2021075957-appb-000026
将第二矩阵加载到所述处理元件的寄存器中,可以按照第二矩阵的行和列的排列方式加载到所述处理元件的寄存器中,也就是说,第二矩阵中的元素在矩阵中的排列方式和在处理元件的寄存器中的排列方式相同。
在一种可能的实现方式中,第二矩阵中的元素在矩阵中的行列数与加载有该元素的处理元件在处理元件组成的阵列中的行列数相同。
举例来说,在一个示例中,可以将A 11加载到PE 11的寄存器中、A 12加载到PE 12的寄存器中、A 13加载到PE 13的寄存器中、A 21加载到PE 21的寄存器中…A 33加载到PE 33的寄存器中,也就是说,第二矩阵中元素的下标可以与其所处的处理元件的下标完全相同。
在另一个示例中,可以将A 11加载到PE 12的寄存器中、A 12加载到PE 13的寄存器中、A 13加载到PE 14的寄存器中、A 21加载到PE 22的寄存器中…A 33加载到PE 34的寄存器中,也就是说,第二矩阵中的元素在矩阵中的排列方式和在处理元件的寄存器中的排列方式相同。
需要说明的是,以上示例仅仅是加载第一矩阵的一些举例,不以任何方式限制本公开,本领域技术人员应当知道,只要满足第一矩阵中的元素在矩阵中的排列方式和在处理元件的寄存器中的排列方式相同即可。
可以根据加载所述第一矩阵的方式将转置矩阵加载到所述处理元件的寄存器中,或者说,加载后,第二矩阵的列与转置矩阵的列对齐,加载后转置矩阵和第二矩阵对应位置的元素存储在同一处理元件的寄存器中。
举例来说,假设将A 11加载到PE 11的寄存器中、A 12加载到PE 12的寄存器中、A 13加载到PE 13的寄存器中、A 21加载到PE 21的寄存器中…A 33加载到PE 33的寄存器中,也就是说,第一矩阵中元素的下标可以与其所处的处理元件的下标完全相同。那么,可以将B 11加载到PE 11的寄存器中、B 21加载到PE 12的寄存器中、B 31加载到PE 13的寄存器中、B 12加载到PE 21的寄存器中、B 22加载到PE 22的寄存器中、B 32加载到PE 23的寄存器中……B 33加载到PE 33的寄存器中。也就是说,将转置矩阵按照与第二矩阵列对齐的排序方式加载到处理元件的寄存器中。
在一种可能的实现方式中,也可以先加载转置矩阵再加载第二矩阵,或者同时加载,本公开对具体加载的方式不作限定,只要保证加载后转置矩阵和第二矩阵在行方向对齐,转置矩阵和第二矩阵对应位置的元素存储在同一处理元件的寄存器中即可。
在一种可能的实现方式中,在加载完输入矩阵之后,对于将右乘矩阵转置的情况,可以在列方向连接存储转置矩阵的第一行元素的处理元件和存储转置矩阵的最后一行元素的处理元件,形成环,在 环内的数据可以进行流动以实现矩阵在列方向上的滚动。如图2-1所示,可以将PE 11与PE 31连接形成环,连接PE 12和PE 32可以形成环,连接PE 13和PE 33可以形成环。这样,当数据在环内进行流动时,如果是向上流动,那么第一行的数据将流动到第三行,第二行的数据将流动到第一行,第三行的数据将流动到第二行;如果是向下流动,那么第一行的数据将流动到第二行,第二行的数据将流动到第三行,第三行的数据将流动到第一行。
在本实施方式中,可以仅对转置矩阵进行滚动,在对转置矩阵进行第一次滚动之前,控制器可以控制处理元件对相应的寄存器内的元素进程乘法运算得到元素乘积,对同一行的元素乘积求和得到第一中间结果。以上述示例为例,控制器可以控制PE 11对其内的寄存器存储的元素A 11和B 11进行乘法运算得到元素乘积A 11×B 11,同样的,控制器可以控制PE 12、PE 13以得到A 12×B 21、A 13×B 31
然后控制器可以将位于同一行的元素乘积求和得到C 11=A 11×B 11+A 12×B 21+A 13×B 31
通过同样的方式可以得到C 22和C 33
在一种可能的实现方式中,可以将C 11、C 22和C 33作为第一列第一中间结果暂时存储在缓存器中。该缓存器可以位于处理器中多个处理元件以外的位置。
接下来,在一种可能的实现方式中,可以对转置矩阵向上滚动一行,第一行的元素滚动到(存储有矩阵的元素的处理元件的)最后一行。或者,也可以对转置矩阵向下滚动一行,本公开对具体滚动的方向不作限定,对于本实施方式中的示例在列方向以行为单位进行滚动即可。
如图2-1所示,在进行向上滚动时,第一行的数据可以滚动到第三行,如下所示:
Figure PCTCN2021075957-appb-000027
在一种可能的实现方式中,可以利用处理元件内多余的寄存器或者处理器中的片上缓存实现矩阵中数据的滚动过程。该实施方式适用于本公开的示例2-1和示例2-2中的滚动过程。
举例来说,以上述示例2-1为例,可以先将转置矩阵的第一行元素暂存在多余的寄存器中,控制第二行的处理元件将对应的寄存器存储的转置矩阵的第二行元素发送给第一行的处理元件,然后再控制第三行的处理元件将对应的寄存器存储的转置矩阵的第三行元素发送给第二行处理元件,最后,可以将暂存的第一行元素存储到第三行的处理元件对应的寄存器中,从而实现转置矩阵的一行数据的滚动过程。以上过程仅仅是本公开的一个示例,不以任何方式限制本公开。
再次进行控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,对同一行的元素乘积求和得到第一中间结果,a 33的第一行乘以
Figure PCTCN2021075957-appb-000028
的第二行得到C 12、a 33的第二行乘以
Figure PCTCN2021075957-appb-000029
的第三行得到C 23、和a 33的第三行乘以
Figure PCTCN2021075957-appb-000030
的第一行得到C 31。将C 12、C 23和C 31作为第二列第一中间结果暂时存储在缓存器中。
再次向上滚动一行转置矩阵,并对相应的寄存器内的元素进程乘法运算得到元素乘积,对同一行的元素乘积求和得到第一中间结果C 13、C 21和C 32,将C 13、C 21和C 32作为第三列第一中间结果暂时存储在缓存器中。
也就是说,缓存器中存储的第一中间结果为
Figure PCTCN2021075957-appb-000031
对于步骤S2-13,对于将转置矩阵向上滚动的情况,所述将第一中间结果进行处理指的是,控制 器将得到的第一中间结果按列存储,然后控制器将第一中间结果中第i行元素在行方向向右滚动i-1步得到输入矩阵的乘积,此处的滚动也是指在行的方向成闭环的滚动,存储有矩阵的元素的第一列处理元件和最后一列处理元件连接形成闭环。在滚动的过程中,如果向右滚动,那么最后一列处理元件中存储的元素滚动到第一列处理元件中。
可选地,对于步骤S2-13,对于将转置矩阵向下滚动的情况,所述将第一中间结果进行处理指的是,控制器将得到的第一中间结果按列存储,然后由控制器将第一中间结果中第i行元素在行方向向左滚动i-1步得到输入矩阵的乘积。
本领域技术人员可以理解的是,对于步骤S2-13,还可以由控制器将根据第一中间结果的行列标识将第一中间结果中的元素在行方向(例如,向右滚动或者向左滚动)滚动得到输入矩阵的乘积。在这种实施方式中,存储在寄存器中的元素都可以携带有元素在矩阵中的行列标识,在滚动的过程中,根据元素在矩阵中所处的行列标识确定第一中间结果中元素的行列标识,从而使得控制器可以根据第一中间结果的行列标识对第一中间结果中的元素在行方向进行滚动得到第一矩阵和第二矩阵的乘积。
以上述示例为例,第1行向右滚动0步,也就是不滚动。第2行向右滚动1步,也就是说C 21向右滚动1步到第1列,C 23向右滚动1步到第3列,C 22向右滚动1步到第2列,得到的结果为:
Figure PCTCN2021075957-appb-000032
将第3行向右滚动2步,得到的输入矩阵的乘积为:
Figure PCTCN2021075957-appb-000033
在一种可能的实现方式中,在步骤S2-12中,还可以对第二矩阵在列方向上进行滚动,具体的过程与转置矩阵滚动的过程类似,只不过对于步骤S2-13中处理和滚动元素的方式稍有区别。本公开对具体的推导过程不再赘述,参考以上过程。
需要说明的是,以上示例中的处理元件的排列、输入矩阵等仅仅是为了清楚说明本公开运算方法的过程,不以任何方式限制本公开。
示例2-2 第一矩阵为左乘矩阵、第二矩阵为右乘矩阵,也就是说对左乘矩阵进行转置
仍然假设第一矩阵a mn和第二矩阵b nk都为3×3矩阵,处理元件为4×4的阵列。
假设第一矩阵
Figure PCTCN2021075957-appb-000034
那么对第一矩阵进行转置的转置矩阵为
Figure PCTCN2021075957-appb-000035
第二矩阵
Figure PCTCN2021075957-appb-000036
将第二矩阵加载到所输出处理元件的寄存器中,加载的方式可以参见示例2-1中加载第一矩阵的方式,不再赘述,然后根据加载第二矩阵的方式将转置矩阵加载到处理元件的寄存器中,加载后,第一矩阵的转置矩阵的行与第二矩阵的行对齐。
举例来说,假设将B 11加载到PE 11的寄存器中、B 12加载到PE 12的寄存器中、B 13加载到PE 13的寄存器中、B 21加载到PE 21的寄存器中…B 33加载到PE 33的寄存器中,也就是说,第一矩阵中元素的下标可以与其所处的处理元件的下标完全相同。那么,可以将A 11加载到PE 11的寄存器中、A 21加载到PE 12的 寄存器中、A 31加载到PE 13的寄存器中、A 12加载到PE 21的寄存器中、A 22加载到PE 22的寄存器中、A 32加载到PE 23的寄存器中……A 33加载到PE 33的寄存器中。也就是说,将转置矩阵按照与另一个矩阵(第二矩阵)以行对齐的排序方式加载到处理元件的寄存器中。
在一种可能的实现方式中,在加载完输入矩阵之后,对于将第一矩阵转置的情况,可以在行方向连接存储转置矩阵的第一列元素的处理元件和存储转置矩阵的最后一列元素的处理元件,形成环,在环内的数据可以进行流动,从而便于在行的方向上以列为单位进行滚动。如图2-4所示,连接PE 11和PE 13可以形成环,连接PE 21和PE 23可以形成环,连接PE 31和PE 33可以形成环,这样,当数据在环内进行流动时,如果是向左流动,那么第一列的数据将流动到第三列,第二列的数据将流动到第一列,第三列的数据将流动到第二列;如果是向右流动,那么第一列的数据将流动到第二列,第二列的数据将流动到第三列,第三列的数据将流动到第一列。
在本实施方式中,可以仅对转置矩阵进行滚动,在对转置矩阵进行按照列方向向左或者向右滚动第一次之前,控制器可以控制处理器元件对相应的寄存器内的元素进行乘法运算得到元素乘积,对同一列的元素乘积求和得到第一中间结果。以上述示例为例,PE 11对其内的寄存器存储的元素A 11和B 11进行乘法运算得到元素乘积A 11×B 11,同样的可以得到A 12×B 21、A 13×B 31
第一列的元素乘积求和可以得到C 11=A 11×B 11+A 12×B 21+A 13×B 31
通过同样的方式可以得到第二列的元素乘积求和C 22、第三列的元素乘积求和C 33
在一种可能的实现方式中,可以将C 11、C 22和C 33作为第一行第一中间结果暂时存储在缓存器中。
接下来可以对转置矩阵向左滚动一列,第一列的元素滚动到最后一列,或者也可以向右滚动一列,本公开对此不作限定。
如图2-1所示,在进行向左滚动时,第一列的数据可以滚动到第三列,如下所示:
Figure PCTCN2021075957-appb-000037
再次进行控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,对同一列的元素乘积求和得到第一中间结果,
Figure PCTCN2021075957-appb-000038
的第二列乘以b 33的第一列得到C 21
Figure PCTCN2021075957-appb-000039
的第三列乘以b 33的第二列得到C 32、和
Figure PCTCN2021075957-appb-000040
的第一列乘以b 33的第三列得到C 13。将C 21、C 32和C 13作为第二行第一中间结果暂时存储在缓存器中。
再次向左滚动一列转置矩阵,并对相应的寄存器内的元素进程乘法运算得到元素乘积,对同一列的元素乘积求和得到第一中间结果C 31、C 12和C 23,将C 31、C 12和C 23作为第三行第一中间结果暂时存储在缓存器中。
也就是说,缓存器中存储的第一中间结果为
Figure PCTCN2021075957-appb-000041
于步骤S2-13,对于将第一转置矩阵向左滚动的情况,可以将第一中间结果按行存储,可以由控制器将第一中间结果中第i列元素在列方向向下滚动i-1步得到输入矩阵的乘积。
可选地,当将第一转置矩阵向右滚动的情况,可以由控制器将第一中间结果按行存储,将第一中 间结果中第i列元素在列方向向上滚动i-1步得到输入矩阵的乘积。具体步骤和向左滚动类似,在此不再赘述。
本领域技术人员可以理解的是,对于步骤S2-13,还可以由控制器将根据第一中间结果的行列标识将第一中间结果中的元素在列方向(例如,向上移或者向下移)滚动得到输入矩阵的乘积。在这种实施方式中,存储在寄存器中的元素都可以携带有元素在矩阵中的行列标识,在滚动的过程中,根据元素在矩阵中所处的行列标识确定第一中间结果中元素的行列标识,从而使得控制器可以根据第一中间结果的行列标识对第一中间结果中的元素在列方向进行滚动得到输入矩阵的乘积。
以上述示例为例,第1列向下滚动0步,也就是不滚动。第2列向下滚动1步,也就是说C 12向下滚动1步到第1列,C 32向下滚动1步到第3列,C 22向下滚动1步到第2列,得到的结果为:
Figure PCTCN2021075957-appb-000042
将第3列向下滚动2步,得到的输入矩阵的乘积为:
Figure PCTCN2021075957-appb-000043
需要说明的是,以上示例中的处理元件的排列、输入矩阵等仅仅是为了清楚说明本公开运算方法的过程,不以任何方式限制本公开。
在一种可能的实现方式中,在步骤S2-12中,还可以对第二矩阵在行方向上进行滚动,具体的过程与转置矩阵滚动的过程类似,只不过对于步骤S2-13中处理和滚动元素的方式稍有区别。本公开对具体的推导过程不再赘述,参考以上过程。
根据本公开上述各实施方式的矩阵乘的运算方法,更适用于以阵列排布的处理元件组成的处理器。对于满足处理元件的排列的任意规模的输入矩阵,都可以得到矩阵乘法的运算结果,并且相比于相关技术中的矩阵乘运算可以减少访存次数,降低带宽压力,提高运算的效率。
对于不进行分块的情况,根据上述示例可以直接得到矩阵乘的结果。对于需要进行分块的情况,对于分块后的第一矩阵和第二矩阵,按照矩阵乘的规则将第一矩阵和对应的第二矩阵相乘得到的结果作为第二中间结果,也就是说可以将分块后得到的第一矩阵和第二矩阵作为矩阵的一个元素执行矩阵乘法的运算过程得到第二中间结果,根据第二中间结果进行计算可以得到所述输入矩阵的乘积。
图2-5示出根据本公开一实施例的分块的示意图。如图2-5所示,控制器可以将矩阵D和E按照以上所述的方式进行分块得到第一矩阵D 11、D 12、D 21、D 22,以及第二矩阵E 11、E 12、E 21、E 22。控制器可以将第一矩阵和第二矩阵作为矩阵的一个元素执行矩阵乘法的运算过程,例如,将矩阵D第一行乘以矩阵E第一列为F 11=D 11×E 11+D 12×E 21,将矩阵D第一行乘以矩阵E第二列为F 12=D 11×E 12+D 12×E 22,将矩阵D第二行乘以矩阵E第一列为F 21=D 21×E 11+D 22×E 21,将矩阵D第二行乘以矩阵E第二列为F 22=D 21×E 12+D 22×E 22。也就是说,为了得到最终的矩阵乘法的运算结果,需要先得到第二中间结果:
D 11×E 11,D 12×E 21,D 11×E 12,D 12×E 22
D 21×E 11,D 22×E 21,D 21×E 12,D 22×E 22
得到第二中间结果的过程可以通过将对应的第一矩阵和第二矩阵分别按照步骤S2-11-步骤S2-13的过程进行运算得到。
通过对输入矩阵进行分块,并针对分块后的矩阵分别进行本公开的矩阵乘法运算得到第二中间结 果,根据第二中间结果可以计算得到输入矩阵的乘积。根据本公开上述实施方式的运算方法,对于任何维度的矩阵都可以快速的实现矩阵相乘的过程。
在一个可选地实施例中,所述分块后的第一矩阵和第二矩阵可以分别依次存储在处理元件中进行计算,也还可以堆叠存储在处理元件中。
示例2-3 堆叠存储结合步骤S2-11-步骤S2-13
举例来说,以处理元件为2×2的阵列,输入矩阵都为4×4矩阵为例对本公开的运算方法进行说明。
假设左乘矩阵
Figure PCTCN2021075957-appb-000044
右乘矩阵为
Figure PCTCN2021075957-appb-000045
那么控制器可以将左乘矩阵和右乘矩阵都划分为2×2的矩阵。
图2-6示出根据本公开一实施例的对矩阵划分的示例。如图2-6所示,控制器可以将左乘矩阵和右乘矩阵都划分为2×2的子矩阵,左乘矩阵划分后得到四个矩阵a 11、a 12、a 21、a 22,其中,a 11
Figure PCTCN2021075957-appb-000046
a 12
Figure PCTCN2021075957-appb-000047
a 21
Figure PCTCN2021075957-appb-000048
a 22
Figure PCTCN2021075957-appb-000049
右乘矩阵划分后得到四个矩阵b 11、b 12、b 21、b 22,其中,b 11
Figure PCTCN2021075957-appb-000050
b 12
Figure PCTCN2021075957-appb-000051
b 21
Figure PCTCN2021075957-appb-000052
b 22
Figure PCTCN2021075957-appb-000053
对于进行分块的情况,如果处理元件包含的寄存器的数量可以满足存储输入矩阵的需求,那么还可以采用堆叠存储的方式将输入矩阵存储到处理元件的寄存器中,来实现输入矩阵的乘法运算。在采用堆叠存储的方式存储输入矩阵时,控制器可以把处理元件中的寄存器分为多个不同的组,每组存储一个分块后的第一矩阵和对应的第二矩阵,本公开对具体分组的方式不作限定,但同一组的寄存器中的每一个可以位于不同的处理元件内。
在采用堆叠存储的方式存储输入矩阵的示例中,一种可能的计算方式是,以分块得到的第一矩阵和第二矩阵为单位对矩阵进行滚动,在计算第二中间结果的过程中,采用步骤S2-11-步骤S2-13的过程进行运算。
以采用步骤S2-11-步骤S2-13的过程计算第二中间结果为例,假设以处理元件为2×2的阵列,以图2-6所示的示例为例,对于本公开的运算方法,第一矩阵可以为左乘矩阵分块得到的,也可以是右乘矩阵分块后得到的。
本公开以第一矩阵为右乘矩阵分块得到的为例,加载第二矩阵,将对应的第一矩阵转置后再加载为例对运算方法进行说明,加载的结果如表2-1和表2-2所示。其中,Reg0、Reg1、Reg2和Reg3分别表示处理元件中的一组寄存器,处理元件为2×2的阵列,每个处理器都包括多个寄存器,控制器可以将多个寄存器分为多组,以本实施例为例,可以分为4组,用位于同一组的寄存器存储一个转置矩阵和对应的第二矩阵,如表2-1和表2-2所示,Reg0存储a 11和b 11,Reg1存储a 12和b 21,Reg2存储a 21和b 12,Reg3存储a 22和b 22,也就是说,矩阵
Figure PCTCN2021075957-appb-000054
的第一行元素乘以矩阵
Figure PCTCN2021075957-appb-000055
的第一列元素、以及第二行元素乘以第二列元素。
表2-1 元素存储示例
Figure PCTCN2021075957-appb-000056
Figure PCTCN2021075957-appb-000057
表2-2 元素存储示例
Figure PCTCN2021075957-appb-000058
在计算过程中,对于一组寄存器内的元素,处理元件可以根据步骤S2-11-步骤S2-13的过程计算得到第二中间结果a 11×b 11、a 12×b 21、a 21×b 12以及a 22×b 22。具体过程不再赘述。根据第二中间结果a 11×b 11、a 12×b 21、a 21×b 12以及a 22×b 22可以计算得到C 11=a 11×b 11+a 12×b 21,C 22=a 21×b 12+a 22×b 22
在计算完上述第二中间结果之后,可以以组为单元对转置矩阵进行滚动。具体来说,对于转置矩阵
Figure PCTCN2021075957-appb-000059
向上滚动一行,也就是说,将Reg2中的转置矩阵的元素滚动到Reg0中,Reg0中的转置矩阵的元素滚动到Reg2中,Reg3中的转置矩阵的元素滚动到Reg1中,Reg1中的转置矩阵的元素滚动到Reg3中,由此,可以得到表2-3。
表2-3 元素存储示例
Figure PCTCN2021075957-appb-000060
结合表2-1和表2-3,在计算过程中,对于一组寄存器内的元素,处理元件可以根据步骤S2-11-步骤S2-13的过程计算得到第二中间结果a 11×b 12、a 12×b 22、a 21×b 11以及a 22×b 21。具体过程不再赘述。根据第二中间结果a 11×b 12、a 12×b 22、a 21×b 11以及a 22×b 21可以计算得到C 12=a 11×b 12+a 12×b 22,C 21=a 21×b 11+a 22×b 21
根据以上过程,可以采用分块的方式计算得到输入矩阵的乘积。
因此,根据本公开的矩阵乘的运算方法可以实现任意大小规模的矩阵运算。
示例2-4 堆叠存储结合整体滚动
在另一种可能的实现方式中,还可以采用另一种滚动方式,在本实施例的滚动方式中,图2-3中的步骤S2-12可以通过以下过程实现,在每次对转置矩阵在行方向或列方向滚动一次之前,控制处理 元件对相应的寄存器内的元素进行乘法运算得到元素乘积,对同一行(或者在对第一矩阵转置的示例中,对同一列)的元素乘积求和得到第一中间结果C 11、C 22、C 33、C 44
由于对输入矩阵进行了分块、堆叠存储,原先的一行或者一列数据被存储在不同组的寄存器内,导致原来一行或一列连续存储的数据变成至少两行或至少两列独立的数据存储在不同组的寄存器时,存储在不同组的寄存器中的数据的下一行或下一列的首个数据与上一行或下一列数据的末尾数据在堆叠存放前是连续存放的数据,而在堆叠存放后是不连续存放的,因此,在控制一组寄存器中的元素在行或列方向上滚动一次之后,需要对滚动结果进行修正,才能得到正确的结果。具体修正的方式可以为:
针对每一块转置矩阵,在行或者列方向上滚动一次;
若在行方向上向左滚动,则修正的方式为,将滚动之后每一块内最后一列数据滚动到相邻的前一块数据的最后一列;
若在行方向上向右滚动,则修正的方式为,将滚动之后每一块内第一列数据滚动到相邻的后一块数据的第一列;
若在列方向上向上滚动,则修正的方式为,将滚动之后每一块内最后一行数据滚动到相邻的前一块数据的最后一行;
若在列方向上向下滚动,则修正的方式为,将滚动之后每一块内第一行数据滚动到相邻的后一块数据的第一行。
其中,以上所述的每一块是指每一块转置矩阵,每一块转置矩阵是指对分块之后的每一块矩阵进行转置之后的矩阵。
对于本实施例,对右乘矩阵进行了转置,在滚动过程中还是在行的方向上进行滚动,只不过由于进行了堆叠存储,存在至少两行之间的元素应该是连续的,但是在堆叠存储时被看成了独立的每行,仅仅在每一组的寄存器内的行方向进行滚动无法实现正确的滚动,还需要进行修正。
以表2-2为例,在每一组寄存器内部,向上滚动一行,滚动结果如表2-4所示,在表2-4中,一组寄存器内第一行元素滚动到最后一行。但如表2-2所示,Reg0和Reg1的第一行元素应该滚动到Reg2和Reg3的最后一行、但现在位于Reg0和Reg1的最后一行(如表2-4所示);如表2-2所示,Reg2和Reg3的第一行元素应该滚动到Reg0和Reg1的最后一行、但现在位于Reg2和Reg3的最后一行(如表2-4所示);也就是说,表2-4中现在Reg0和Reg1的最后一行元素应该位于Reg2和Reg3的最后一行,Reg2和Reg3的最后一行元素应该位于Reg0和Reg1的最后一行,那么交换Reg2和Reg0的最后一行元素、以及交换Reg3和Reg1的最后一行元素即可实现滚动的过程,如表2-5所示。
表2-4 元素存储示例
Figure PCTCN2021075957-appb-000061
表2-5 元素存储示例
Figure PCTCN2021075957-appb-000062
Figure PCTCN2021075957-appb-000063
根据较表2-1和表2-5,控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,对同一行的元素乘积求和得到第一中间结果C 12、C 23、C 34、C 41
重复执行上述过程中的4次计算、3次滚动即可完成矩阵乘的运算过程,根据第一中间结果可以得到输入矩阵的乘积。
在一个可选地实施例中,所述堆叠存储的方式可以根据上文中分块的方式存储,不限于每一个寄存器都存储矩阵中的一个元素,不限于所述矩阵乘的行列数是处理元件行列数的整数倍,也不限于所述堆叠存储的方法是唯一的,在所述修正过程是一样的,只需要满足在修正后原本的一行/列元素能够串联起来即可,具体堆叠存储过程在此不作限制。
需要说明的是,以上堆叠存储、滚动元素的方式仅仅是本公开的一个示例,还可以采用其他的方式实现,本公开对此不作限定。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本公开并不受所描述的动作顺序的限制,因为依据本公开,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本公开所必须的。
进一步需要说明的是,虽然流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
本公开还提供了一种基于处理元件矩阵的矩阵乘的运算装置,该运算装置可以应用于处理器。图2-1所示为处理器的一个示例,处理器可以包括两个以上处理元件,两个以上处理元件以二维矩阵排列,每个处理元件包括至少一个寄存器,所述运算装置用于实现对第一矩阵和第二矩阵的矩阵乘法运算。
应该理解,上述的装置实施例仅是示意性的,本公开的装置还可通过其它的方式实现。例如,上述实施例中所述单元/模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。例如,多个单元、模块或组件可以结合,或者可以集成到另一个系统,或一些特征可以忽略或不执行。
另外,若无特别说明,在本公开各个实施例中的各功能单元/模块可以集成在一个单元/模块中,也可以是各个单元/模块单独物理存在,也可以两个或两个以上单元/模块集成在一起。上述集成的单元/模块既可以采用硬件的形式实现,也可以采用软件程序模块的形式实现。
所述集成的单元/模块如果以硬件的形式实现时,该硬件可以是数字电路,模拟电路等等。硬件结构的物理实现包括但不局限于晶体管,忆阻器等等。若无特别说明,所述寄存器可以是任何适当的磁存储介质或者磁光存储介质,比如,阻变式存储器RRAM(Resistive Random Access Memory)、动 态随机存取存储器DRAM(Dynamic Random Access Memory)、静态随机存取存储器SRAM(Static Random-Access Memory)、增强动态随机存取存储器EDRAM(Enhanced Dynamic Random Access Memory)、高带宽内存HBM(High-Bandwidth Memory)、混合存储立方HMC(Hybrid Memory Cube)等等。
所述集成的单元/模块如果以软件程序模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
本公开实施例还提出一种计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现上述方法。计算机可读存储介质可以是非易失性计算机可读存储介质。
本公开实施例还提出一种人工智能芯片,所述芯片包括如上所述的处理器。
在一种可能的实现方式中,还公开了一种板卡,其包括存储器件、接口装置和控制器件以及上述人工智能芯片;其中,所述人工智能芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;所述存储器件,用于存储数据;所述接口装置,用于实现所述人工智能芯片与外部设备之间的数据传输;所述控制器件,用于对所述人工智能芯片的状态进行监控。
依据以下条款可更好地理解前述内容:
条款B1.一种处理器,所述处理器包括两个以上处理元件,所述两个以上处理元件以二维矩阵排列,处理元件包括至少一个寄存器,所述处理器用于对第一矩阵和第二矩阵执行矩阵乘法运算,
所述处理器还包括控制器,所述控制器用于将第一矩阵的转置矩阵和第二矩阵的各元素分别加载到各处理元件的寄存器中,所述转置矩阵和所述第二矩阵对应位置的元素存储在同一处理元件的寄存器中;
所述控制器用于控制所述转置矩阵或者第二矩阵在行方向或者列方向滚动,控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积、将同一行或同一列的元素乘积求和得到第一中间结果;
所述控制器还用于对所述第一中间结果进行处理得到第一矩阵和第二矩阵的乘积。
条款B2.根据条款B1所述的处理器,
控制器控制处理元件、存储在寄存器内的转置矩阵和第二矩阵重复以下过程,直到转置矩阵或第二矩阵中的元素恢复到未滚动时的位置:
所述控制器用于控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,对同一行或者同一列的元素乘积求和得到第一中间结果,控制存储在寄存器中的转置矩阵或第二矩阵在行方向或列方向滚动一行或一列。
条款B3.根据条款B1或B2所述的处理器,
在第一矩阵为左乘矩阵、第二矩阵为右乘矩阵时,控制器控制转置矩阵中的元素在行方向上滚动,或者控制第二矩阵中的元素在行方向上滚动;控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,对同一列的元素乘积求和得到第一中间结果;
在第一矩阵为右乘矩阵、第二矩阵为左乘矩阵时,控制器控制转置矩阵中的元素在列方向上滚动、或者控制第二矩阵中的元素在列方向上滚动;控制处理元件对相应的寄存器内的元素进行乘法运算得 到元素乘积,对同一行的元素乘积求和得到第一中间结果。
条款B4.根据条款B1或B2所述的处理器,
所述控制器将第一中间结果按行或者按列存储,在行方向或者列方向进行滚动后得到第一矩阵和第二矩阵的乘积。
条款B5.根据条款B1-B4任意一项所述的处理器,所述控制器还用于根据处理元件的排列以及输入矩阵的行秩以及列秩确定是否对输入矩阵进行分块,其中,输入矩阵包括左乘矩阵和右乘矩阵;
若要对输入矩阵中的一个矩阵进行分块,控制器根据处理元件的排列对左乘矩阵的行进行拆分或者对右乘矩阵的列进行拆分;
若要对输入矩阵中的两个矩阵都进行分块,控制器根据处理元件的排列以及输入矩阵的行秩和列秩对左乘矩阵列方向和右乘矩阵行方向以相同的方式进行分块;
对左乘矩阵分块后得到两个以上所述第一矩阵,对右乘矩阵分块后得到两个以上所述第二矩阵,或者,对左乘矩阵分块后得到两个以上所述第二矩阵,对右乘矩阵分块后得到两个以上所述第一矩阵。
条款B6.根据条款B5所述的处理器,
所述控制器还用于根据第一矩阵和第二矩阵的乘积计算所述左乘矩阵和所述右乘矩阵的乘积。
条款B7.根据条款B5所述的处理器,所述处理器包括多组寄存器,
所述控制器还用于在对所述输入矩阵进行分块后,将两个以上所述第一矩阵进行转置得到转置矩阵;
控制器将转置矩阵、和两个以上所述第二矩阵加载到所述多组寄存器中堆叠存储,一组寄存器中存储有对应位置的转置矩阵和第二矩阵;
在每次对转置矩阵或第二矩阵中的元素在行方向或列方向滚动一次之前,控制器控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,对同一行或者同一列的元素乘积求和得到第一中间结果;
在控制一组寄存器中的元素在行或列方向上滚动一行或一列转置矩阵之后,控制器还对滚动结果进行修正。
条款B8.根据条款B7所述的处理器,对滚动结果进行修正包括:
若在行方向上向左滚动,则修正的方式为,将滚动之后每一块转置矩阵内最后一列数据滚动到相邻的前一块转置矩阵数据的最后一列;
若在行方向上向右滚动,则修正的方式为,将滚动之后每一块转置矩阵内第一列数据滚动到相邻的后一块转置矩阵数据的第一列;
若在列方向上向上滚动,则修正的方式为,将滚动之后每一块转置矩阵内最后一行数据滚动到相邻的前一块转置矩阵数据的最后一行;
若在列方向上向下滚动,则修正的方式为,将滚动之后每一块转置矩阵内第一行数据滚动到相邻的后一块转置矩阵数据的第一行;
其中,每一块转置矩阵是指对分块之后的每一块矩阵进行转置之后的矩阵。
条款B9.一种基于处理元件矩阵的矩阵乘的运算方法,应用于处理器,所述处理器包括两个以上处理元件,所述两个以上处理元件以二维矩阵排列,处理元件包括至少一个寄存器,所述方法实现对第一矩阵和第二矩阵的矩阵乘法运算,所述方法包括:
将第一矩阵进行转置得到转置矩阵,将所述转置矩阵和所述第二矩阵的各元素分别加载到各处理元件的寄存器中,所述转置矩阵和所述第二矩阵对应位置的元素存储在同一处理元件的寄存器中;
控制所述转置矩阵或者第二矩阵在行方向或者列方向滚动,控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积、将同一行或同一列的元素乘积求和得到第一中间结果;
对所述第一中间结果进行处理得到第一矩阵和第二矩阵的乘积。
条款B10.根据条款B9所述的运算方法,控制所述转置矩阵或者第二矩阵在行方向或者列方向滚动,控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积、将同一行或同一列的元素乘积求和得到第一中间结果,包括,重复以下过程直到转置矩阵或第二矩阵中的元素恢复到未滚动时的位置:
控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,对同一行或者同一列的元素乘积求和得到第一中间结果,在处理元件的矩阵中对转置矩阵或第二矩阵在行方向或列方向滚动一行或一列。
条款B11.根据条款B9或B10所述的方法,
在第一矩阵为左乘矩阵、第二矩阵为右乘矩阵时,控制转置矩阵中的元素在行方向上滚动,或者控制第二矩阵中的元素在行方向上滚动;控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,对同一列的元素乘积求和得到第一中间结果;
在第一矩阵为右乘矩阵、第二矩阵为左乘矩阵时,控制转置矩阵中的元素在列方向上滚动、或者控制第二矩阵中的元素在列方向上滚动;控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,对同一行的元素乘积求和得到第一中间结果。
条款B12.根据条款B9或B10所述的方法,将所述第一中间结果进行处理得到所述第一矩阵和第二矩阵的乘积,包括:
将第一中间结果按行或者按列存储,在行方向或者列方向进行滚动后得到第一矩阵和第二矩阵的乘积。
条款B13.根据条款B9-B12任意一项所述的方法,所述方法还包括:
根据处理元件的排列以及输入矩阵的行秩以及列秩确定是否对输入矩阵进行分块,其中,输入矩阵包括左乘矩阵和右乘矩阵;
若要对输入矩阵中的一个矩阵进行分块,根据处理元件的排列对左乘矩阵的行进行拆分或者对右乘矩阵的列进行拆分;
若要对输入矩阵中的两个矩阵都进行分块,根据处理元件的排列以及输入矩阵的行秩和列秩对左乘矩阵列方向和右乘矩阵行方向以相同的方式进行分块;
对左乘矩阵分块后得到两个以上所述第一矩阵,对右乘矩阵分块后得到两个以上所述第二矩阵,或者,对左乘矩阵分块后得到两个以上所述第二矩阵,对右乘矩阵分块后得到两个以上所述第一矩阵。
条款B14.根据条款B13所述的方法,所述方法还包括:
根据第一矩阵和第二矩阵的乘积计算所述左乘矩阵和所述右乘矩阵的乘积。
条款B15.根据条款B13所述的方法,所述处理器包括多组寄存器,
所述方法还包括:
在对所述输入矩阵进行分块后,将两个以上所述第一矩阵进行转置得到转置矩阵;
在所述多组寄存器中堆叠存储所述转置矩阵、和两个以上所述第二矩阵,一组寄存器中存储有对应位置的转置矩阵和第二矩阵;
在每次对转置矩阵或第二矩阵中的元素在行方向或列方向滚动一次之前,控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,对同一行或者同一列的元素乘积求和得到第一中间结果;
在控制一组寄存器中的元素在行或列方向上滚动一行或一列转置矩阵之后,对滚动结果进行修正。
条款B16.根据条款B15所述的方法,对滚动结果进行修正包括:
若在行方向上向左滚动,则修正的方式为,将滚动之后每一块转置矩阵内最后一列数据滚动到相邻的前一块转置矩阵数据的最后一列;
若在行方向上向右滚动,则修正的方式为,将滚动之后每一块转置矩阵内第一列数据滚动到相邻的后一块转置矩阵数据的第一列;
若在列方向上向上滚动,则修正的方式为,将滚动之后每一块转置矩阵内最后一行数据滚动到相邻的前一块转置矩阵数据的最后一行;
若在列方向上向下滚动,则修正的方式为,将滚动之后每一块转置矩阵内第一行数据滚动到相邻的后一块转置矩阵数据的第一行;
其中,每一块转置矩阵是指对分块之后的每一块矩阵进行转置之后的矩阵。
条款B17.一种人工智能芯片,所述芯片包括如条款B1-B8中任意一项所述的处理器。
条款B18.一种电子设备,包括如条款B17所述的人工智能芯片。
矩阵运算在利用人工智能对信息进行处理的过程中占据比较大的计算量,并且现有的处理器在处理矩阵运算的过程中把矩阵运算拆解成乘法运算和加法运算逐步运算,需要频繁的从内存中读取数据,运算的效率很低。
相关技术中,对于输入矩阵规模比较大的矩阵乘法,为了提高矩阵运算的效率,通常采用多级流水线的方式实现运算的过程,但多级流水线由于每一级对输入数据中的一部分进行处理,因此,需要频繁的从内存中读取数据,频繁访问内存导致对带宽的要求较高。
为了解决上述技术问题,本公开提供了一种运算方法以及执行该运算方法的处理器。处理器可以包括多个处理元件,在一些实施方式中,多个处理元件可以以二维矩阵的形式排列以更好的适应矩阵运算。
图3-1示出根据本公开一实施例的处理器的示意图。如图3-1所示,处理器包括多个处理元件PE(Processing Element)以二维矩阵的形式排列,每个处理元件与相邻的处理元件之间连接,每个PE中可以设置有至少一个寄存器(register)(图中未示出)。在运算过程中,处理器可以将矩阵的元素加载到各个PE对应的寄存器中,处理器可以控制PE可以对PE内设置的寄存器存储的元素进行运算。
处理器还可以包括控制器和存储器,其中,控制器和存储器都与多个处理元件连接,且控制器可以连接存储器。所述控制器用于从存储器中加载输入数据到处理元件的寄存器中,并控制处理元件对输入数据进行处理,比如说,存储器中可以存储有第一矩阵和第二矩阵(或者左乘矩阵和右乘矩阵),处理器用于对第一矩阵和第二矩阵执行矩阵乘法运算,因此,控制器可以将第一矩阵和第二矩阵加载到处理元件的寄存器中,并控制处理元件执行矩阵乘法运算。
在一种可能的实现方式中,存储器中还可以存储有可执行程序,可执行程序中可以包括指令,执行指令可以实现对第一矩阵和第二矩阵的矩阵乘法运算。控制器中可以设置有加载器、译码器等,其中,加载器可以用于将存储器中的输入数据加载到处理元件的寄存器中,译码器可以根据加载后输入数据的存储地址对可执行程序中访问数据的指令进行译码,比如说,对于访问数据的指令,通过译码获得输入数据在寄存器中存储的地址赋值给访问数据的指令,并将译码后的指令发送给处理元件,由处理元件执行指令,从而实现对数据的处理,比如说实现对第一矩阵和第二矩阵的矩阵乘法运算。
在一种可能的实现方式中,存储器可以为片上缓存,控制器可以将片外闪存上的可执行程序以及输入数据(例如,输入矩阵,包括左乘矩阵和右乘矩阵)加载到上述存储器(片上缓存)中,再进行 之后的矩阵乘法运算的过程。
在一种可能的实现方式中,控制器也可以直接从片外内存上加载输入矩阵以及可执行程序到处理元件的寄存器中,本公开对此不作限定。
PE中还可以包括运算器以完成指定的运算,以矩阵运算为例,PE中可以包括例如乘法器、加法器等,各个PE中的具体结构可以相同,也可以存在不同,本公开对此不作限定。PE中还可以包括其他类型的运算器,以适应各种不同的运算过程,本公开对PE包括的运算器的数量和类型不作限定。
在一种可能的实现方式中,处理器(控制器)还可以对输入数据进行预处理得到与预处理后的输入数据,将预处理后的输入数据加载到处理元件的寄存器中,控制处理元件对预处理后的输入数据进行运算。
乘法操作的输入矩阵可以包括左乘矩阵和右乘矩阵,其中,左乘矩阵可以是指位于乘号左边的矩阵,右乘矩阵可以是指位于乘号右边的矩阵。
由于处理器中PE的数量以及排列方式是已知的,因此,在加载数据并计算之前,控制器可以先根据处理元件的排列以及输入矩阵的行秩以及列秩确定是否对输入矩阵进行分块。对于分块后的每一块矩阵进行运算得到第一中间结果,控制器可以控制处理元件根据第一中间结果计算输入矩阵的乘积。
其中,处理元件的排列可以是指处理元件的行数和列数,输入矩阵的行秩、列秩可以是指左乘矩阵以及右乘矩阵的行数和列数。
根据处理元件的排列以及输入矩阵的行秩以及列秩确定是否对输入矩阵进行分块可以是指:控制器可以判断输入矩阵的行数是否大于处理元件的行数、列数是否大于处理元件的列数,根据判断的结果确定是否对输入矩阵进行分块。
如果输入矩阵中的两个矩阵的行数都不大于处理元件的行数、且列数都不大于处理元件的列数,则控制器可以不对输入矩阵进行分块。
如果输入矩阵中的任意一个矩阵的行数大于处理元件的行数、或者列数大于处理元件的列数,则控制器可以对输入矩阵进行分块。
举例来说,假设处理元件组成的阵列为M×N的矩阵,可以表示为PE MN,假设一个输入矩阵为m×n的矩阵,可以表示为A mn,另一个输入矩阵为n×k的矩阵,可以表示为B nk。如果控制器判断矩阵A mn的行数m不大于处理元件的行数M、且列数n不大于处理元件的列数N,而且,B nk的行数n不大于处理元件的行数M、且列数k不大于处理元件的列数N,则控制器可以不对输入矩阵进行分块。
如果矩阵A mn的行数m大于处理元件的行数M、或者列数n大于处理元件的列数N,或者矩阵B nk的行数n大于处理元件的行数M、或列数k大于处理元件的列数N,则控制器可以对输入矩阵进行分块。
如果要对输入矩阵进行分块,那么假设对左乘矩阵分块后可以得到两个以上第一矩阵,对右乘矩阵分块后可以得到两个以上第二矩阵。
对于分块的情况:若左乘矩阵的列数不大于处理元件的列数、右乘矩阵的行数不大于处理元件的行数,左乘矩阵的行数大于处理元件的行数,则控制器可以确定对输入矩阵中的左乘矩阵进行分块,右乘矩阵的列数大于处理元件的列数,则控制器可以确定对右乘矩阵进行分块;若要对左乘矩阵进行分块,控制器可以根据处理元件的排列对左乘矩阵的行进行拆分,若要对右乘矩阵进行分块,控制器可以根据处理元件的排列对右乘矩阵的列进行拆分。
若输入矩阵中的左乘矩阵的列数大于处理元件的列数、或者右乘矩阵的行数大于处理元件的行数,则控制器可以对输入矩阵中的两个矩阵都进行分块,由于为了使得分块后的矩阵可以进行矩阵乘法运算,只要对左乘矩阵的列进行拆分、就必须对右乘矩阵的行进行拆分,因此不管是左乘矩阵的列数大 于处理元件的列数还是右乘矩阵的行数大于处理元件的行数,控制器都需要对两个矩阵进行分块;若要对输入矩阵中的两个矩阵都进行分块,控制器可以根据处理元件的排列以及输入矩阵的行秩和列秩对左乘矩阵列方向和右乘矩阵行方向以相同的方式进行分块。
举例来说,假设处理元件组成2×2的阵列为PE 22,左乘矩阵为A 32,右乘矩阵为B 22,那么可以将左乘矩阵A 32拆分为矩阵A 12、矩阵A 22分别与右乘矩阵B 22相乘。若左乘矩阵为A 22、右乘矩阵为B 23,那么可以将右乘矩阵B 23拆分为矩阵B 21、矩阵B 22
对于要对输入矩阵中的两个矩阵都进行分块的情况,控制器可以在左乘矩阵的列方向上和右乘矩阵的行方向上以相同的方式进行分块,其中,所述相同的方式划分指的是划分后所得的第一矩阵的列数和对应的第二矩阵的行数是相同的,以保证能正常完成矩阵运算。
根据处理元件的排列以及输入矩阵的行秩和列秩对左乘矩阵列方向和右乘矩阵行方向以相同的方式进行分块,分块后得到的第一矩阵和第二矩阵都需要满足不需要再进行分块的条件,也就是说,第一矩阵和第二矩阵的行数都不大于处理元件的行数、且列数都不大于处理元件的列数。
在一种可能的实现方式中,可以按照划分出的第一矩阵或者第二矩阵的行秩和列秩尽量接近处理元件的行数和列数的方式进行划分,这样可以提高运算的效率,缩短运算时间。也就是说,假设处理元件为4×4的阵列,那么可以先按照划分出的矩阵为4×4的方式进行划分,这样可以最大效率的利用处理元件,提高运算效率。
举例来说,假设处理元件为2×2的阵列,输入矩阵一个为2×4矩阵、一个为4×3矩阵。划分的方式可以有很多种,图3-2a和图3-2b分别示出了多种不同的划分方式,矩阵A 24在列方向和矩阵B 43在行方向以相同的方式进行分块。图3-2a是划分的一个示例,矩阵A 24在列方向划分为两部分,每一部分包括两列,矩阵B 43在行方向划分为两部分,每一部分包括两行,包括图3-2a中(1)和(2)两种情况;图3-2b是划分的另一个示例,矩阵A 24在列方向划分为三部分,其中一部分包括两列、另外两部分都包括一列,矩阵B 43在行方向划分为三部分,其中一部分包括两行、另外两部分都包括一行。以上处理元件的排列以及输入矩阵的划分方式仅仅是本公开的一个示例,不以任何方式限制本公开。
对于左乘矩阵的行方向和右乘矩阵的列方向的划分方式,本公开不作具体的限定,只要划分后的矩阵都满足不需要再进行分块的条件即可。
对于不分块的情况,或者分块后的第一矩阵和第二矩阵,图3-3示出根据本公开一实施例的运算方法的流程图。对于不分块的情况,控制器也可以直接把左乘矩阵作为第一矩阵、右乘矩阵作为第二矩阵。图3-3所示的方法可以由处理器中的控制器执行或者控制器控制处理元件执行,如图3-3所示,本公开提供的运算方法可以包括以下步骤:
步骤S3-31,对第一矩阵和第二矩阵进行预处理得到第三矩阵和第四矩阵,其中,第三矩阵和第四矩阵对应位置的元素存储在同一处理元件的寄存器中。
其中,第三矩阵和第四矩阵都为p×p矩阵,p=max(m,k,n),m表示第一矩阵的行秩,n表示第二矩阵的列秩,第一矩阵的列秩和第二矩阵的行秩为k,max(m,k,n)表示取m、k、n中的最大值;
步骤S3-32,对第三矩阵和第四矩阵在行方向或者列方向进行滚动,控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积矩阵;
步骤S3-33,根据对第一矩阵和第二矩阵预处理的方式对元素乘积矩阵进行处理得到第一矩阵和第二矩阵的乘积。
对于步骤S3-31中的预处理,步骤S3-32中的不同滚动方式对应不同的预处理方式。预处理可以包括:第一预处理和第二预处理,第一预处理可以是指对第一矩阵和第二矩阵进行扩充,第二预处理可 以是指对扩充后的矩阵中的元素进行滚动。
对于第一预处理的过程,控制器可以采用0对第一矩阵和第二矩阵进行扩充,具体地,假设第一矩阵为m×k、第二矩阵为k×n,控制器可以确定m、k、n三者中的最大值p,然后用0在第一矩阵和第二矩阵的下侧和/或右侧扩充形成p×p矩阵。
对于第二预处理的过程,在步骤S3-32中采用不同的滚动方式所对应的第二预处理的过程也是不同的。在一种可能的实现方式中,步骤S3-32可以包括以下过程:
步骤S3-321,控制器控制处理元件对相应的寄存器内的元素进行乘法运算得到第一元素乘积矩阵;
步骤S3-322,控制器重复(p-1)次以下过程:将第三矩阵整体向左滚动一步、将第四矩阵整体向上滚动一步,或者,将第三矩阵整体向右滚动一步、将第四矩阵整体向下滚动一步,控制处理元件对相应的寄存器内的元素进行乘法运算得到第二元素乘积矩阵。
也就是说,控制器在对第三矩阵和第四矩阵进行滚动之前,可以控制处理元件对相应的寄存器内的元素进行乘法运算得到第一元素乘积矩阵。之后,控制器可重复以下过程p-1次:将第三矩阵整体向左滚动一步、将第四矩阵整体向上滚动一步,控制处理元件对相应的寄存器内的元素进行乘法运算得到第二元素乘积矩阵;或者重复以下过程p-1次:将第三矩阵整体向右滚动一步、将第四矩阵整体向下滚动一步,控制处理元件对相应的寄存器内的元素进行乘法运算得到第二元素乘积矩阵。也就是说,执行完步骤S3-322后,控制器可以控制处理元件计算得到p-1个第二元素乘积矩阵。
对于步骤S3-322中每次将第三矩阵整体向左滚动一步、将第四矩阵整体向上滚动一步的过程,对应的第二预处理的过程可以为“将扩充后的第一矩阵的第i行向左滚动i步,将扩充后的第二矩阵的第j列向上滚动j步,其中i、j为自然数,且0≤i≤p-1,0≤j≤p-1”,而对于步骤S3-322中每次将第三矩阵整体向右滚动一步、将第四矩阵整体向下滚动一步的过程,对应的第二预处理的过程可以为“将扩充后的第一矩阵的第i行向左滚动i步、再整体向右滚动1步,将扩充后的第二矩阵的第j列向上滚动j步、再整体向下滚动1步”,或者说“将扩充后的第一矩阵的第i行向左滚动i-1步,将扩充后的第二矩阵的第j列向上滚动j-1步”。
在一种可能的实现方式中,可以在存储有矩阵的元素的处理元件之间形成闭环,由于相邻的处理元件之间是连接在一起的,因此控制器可以根据矩阵的维度确定成环的方式,比如说,如果要在列方向滚动,那么,存储有矩阵的元素的第一行处理元件和最后一行处理元件连接起来,在滚动的过程中,如果向上滚动,那么矩阵的第一行元素从原来存储的位置滚动到最后一行元素存储的位置。若要在行方向上滚动,那么,存储有矩阵的元素的第一列处理元件和最后一列处理元件连接起来,在滚动的过程中,如果向左滚动,那么矩阵的第一列元素从原来存储的位置滚动到最后一列元素存储的位置。上述的处理元件与处理元件的连接可以是指虚拟的连接,也就是说,并没有实际的连接线路,而是控制器记录了对应的处理器,在滚动的过程中形成闭环即可。
在一种可能的实现方式中,对第一矩阵和第二矩阵的预处理还可以包括加载过程,加载过程可以是在第一预处理和第二预处理之前执行,也可以是在第一预处理和第二预处理之后执行。也就是说,在本公开的实施方式中,也可以先将第一矩阵和第二矩阵加载到处理元件的寄存器中,然后对第一矩阵和第二矩阵进行第一预处理和第二预处理的过程得到第三矩阵和第四矩阵,也可以在控制器外完成对第一矩阵和第二矩阵的第一预处理和第二预处理后得到第三矩阵和第四矩阵,再将第三矩阵和第四矩阵加载到处理元件的寄存器中,本公开对此不作限定。
需要说明的是,以上步骤S3-321、步骤S3-322中的滚动和计算的过程以及对应的预处理过程仅仅是本公开的一个示例,本公开不限于此。
在一种可能的实现方式中,步骤S3-33可以包括:将第一元素乘积矩阵和多个第二元素乘积矩阵求和得到第五矩阵,根据对第一矩阵和第二矩阵预处理的方式对第五矩阵进行处理得到矩阵乘积。
其中,对于步骤S3-33中的根据对第一矩阵和第二矩阵预处理的方式对第五矩阵进行的处理,可以根据第一预处理的过程对第五矩阵进行处理,比如说,在第一矩阵和第二矩阵的右侧和下侧添加元素0形成p×p矩阵,这样,对第五矩阵的后处理可以是在第五矩阵的右侧和下侧反扩充,例如,将第五矩阵右侧和下侧的元素0去掉形成m×n矩阵。
根据本公开上述实施方式的矩阵乘的运算方法,进行矩阵乘法运算时不需要拆解运算、不需要反复读取数据,减少读取内存的次数,降低带宽压力,运算效率高。且对于任意规模的输入矩阵,都可以通过预处理的方式对输入矩阵进行变换,然后进行运算,可以得到矩阵乘法的运算结果。
应用示例
举例来说,假设第一矩阵为
Figure PCTCN2021075957-appb-000064
第二矩阵为
Figure PCTCN2021075957-appb-000065
由于第一矩阵为2×2、第二矩阵为2×3,也就是说m=2,k=2,n=3,因此,p可以为最大值3。
对于步骤S3-31,可以先将第一矩阵和第二矩阵加载到处理元件的寄存器中,之后执行第一预处理的过程:将第一矩阵扩充为
Figure PCTCN2021075957-appb-000066
将第二矩阵扩充为
Figure PCTCN2021075957-appb-000067
在一种可能的实现方式中,加载时可以将第一矩阵和第二矩阵的第一行、第一列元素加载到同一个处理元件的寄存器中。例如,可以将第一矩阵加载到处理元件的第一组寄存器Reg0中,将第二矩阵加载到处理元件的第二组寄存器Reg1中。其中,Reg0中的每一个框可以表示不同处理元件中的寄存器,Reg1中的每一个框可以表示不同处理元件中的寄存器。A 11和B 11存储在同一个处理元件的寄存器中。这里的第一组寄存器或第二组寄存器可以是指物理上划分为不同层的一层寄存器,也可以是逻辑上划分的一组寄存器,本公开对此不作限定。
Figure PCTCN2021075957-appb-000068
Figure PCTCN2021075957-appb-000069
控制器还可以在行方向或者列方向连接处理元件形成闭环,比如说可以在列方向连接存储扩充后的第一矩阵和第二矩阵的第一行元素的处理元件和最后一行元素的处理元件,形成环,在环内的数据可以进行流动以实现矩阵在列方向上的滚动。或者也可以在行方向连接存储扩充后的第一矩阵和第二矩阵的第一列元素的处理元件和最后一列元素的处理元件,形成环,在环内的数据可以进行流动以实现矩阵在行方向上的滚动。
以上述示例来说,可以连接PE 11与PE 31形成闭环、连接PE 12和PE 32形成闭环、连接PE 13和PE 33形成闭环。这样,当数据在环内进行流动时,如果是向上流动,那么第一行的数据将流动到第三 行,第二行的数据将流动到第一行,第三行的数据将流动到第二行;如果是向下流动,那么第一行的数据将流动到第二行,第二行的数据将流动到第三行,第三行的数据将流动到第一行。
还可以连接PE 11和PE 13形成闭环、连接PE 21和PE 23形成闭环、连接PE 31和PE 33形成闭环。这样,当数据在环内进行流动时,如果是向左流动,那么第一列的数据将流动到第三列,第二列的数据将流动到第一列,第三列的数据将流动到第二列;如果是向右流动,那么第一列的数据将流动到第二列,第二列的数据将流动到第三列,第三列的数据将流动到第一列。
第二预处理的过程:在一个示例中(示例3-1),对于矩阵a 33来说,控制器不需要对第0行滚动,控制第1行的元素依次向左滚动1步、第2行的元素依次向左滚动2步得到的第三矩阵如下:
Figure PCTCN2021075957-appb-000070
对于矩阵b 33来说,控制器不需要对第0列滚动,控制第1列的元素依次向上滚动1步,第2列的元素依次向上滚动2步得到的第四矩阵如下:
Figure PCTCN2021075957-appb-000071
对于第二预处理的过程:在另一个示例(示例3-2)中,对于矩阵a 33来说,控制器不需要对第0行滚动,控制第1行的元素依次向左滚动1步,第2行的元素依次向左滚动2步,再控制矩阵中的元素整体向右滚动1步得到的第三矩阵(或者说,控制器控制第0行向右滚动1步,控制第1行元素不滚动,控制第2行元素向左滚动1步)如下:
Figure PCTCN2021075957-appb-000072
对于矩阵b 33来说,控制器不需要对第0列滚动,控制第1列的元素依次向上滚动1步,第2列的元素依次向上滚动2步,再整体向下滚动1步得到的第四矩阵如下:
Figure PCTCN2021075957-appb-000073
在一种可能的实现方式中,还可以在完成对第一矩阵和第二矩阵的预处理得到第三矩阵和第四矩阵后,将第三矩阵和第四矩阵加载到处理元件的寄存器中。将对应位置的第三矩阵和第四矩阵内容的元素加载到同一个处理元件的寄存器中即可,不需要对第三矩阵和第四矩阵进行转置,也就是说,将第三矩阵和第四矩阵以行列对齐的方式加载到处理元件的寄存器中。
例如,可以将第三矩阵加载到处理元件的第一组寄存器Reg0中,将第四矩阵加载到处理元件的第二组寄存器Reg1中。其中,Reg0中的每一个框可以表示不同处理元件中的寄存器,Reg1中的每一个框可以表示不同处理元件中的寄存器,如图3-1所示,结合上文所述的示例3-1中预处理得到的第三矩阵和第四矩阵,元素A 11、元素B 11存储的位置可以是处理元件PE 11中的寄存器,元素A 12、元素B 22存储的位置可以是指处理元件PE 12中的寄存器,元素A 21、元素B 13存储的位置可以是指处理元件PE 23中的寄存器……。这里的第一组寄存器或第二组寄存器可以是指物理上划分为不同层的一层寄存器,也可以是逻辑上划分的一组寄存器,本公开对此不作限定。
需要说明的是,该实施例仅仅是本公开的一个示例,不以任何方式限制本公开,只要按照行列对齐的方式将第三矩阵和第四矩阵加载到处理元件的寄存器中即可。
Figure PCTCN2021075957-appb-000074
Figure PCTCN2021075957-appb-000075
控制处理元件对相应的寄存器内的元素进行乘法运算得到第一元素乘积矩阵,第一元素乘积矩阵可以如下所示,
A 11B 11 A 12B 22 0
A 22B 21 0 A 21B 13
0 0 0
对于步骤S3-32,仍然以示例3-1为例,将第三矩阵整体向左滚动一步可以得到
Figure PCTCN2021075957-appb-000076
将第四矩阵整体向上滚动一步可以得到
Figure PCTCN2021075957-appb-000077
控制处理元件对相应的寄存器内的元素进行乘法运算得到第二元素乘积矩阵,第二元素乘积矩阵可以如下所示,
A 12B 21 0 A 11B 13
0 A 21B 12 A 22B 23
0 0 0
p为3,p-1为2,因此,还需要对第三矩阵整体向左滚动一步,将第四矩阵整体向上滚动一步,
Figure PCTCN2021075957-appb-000078
Figure PCTCN2021075957-appb-000079
控制处理元件对相应的寄存器内的元素进行乘法运算得到第二元素乘积矩阵,
0 A 11B 12 A 12B 23
A 21B 11 A 22B 22 0
0 0 0
对于步骤S3-33,将第一元素乘积矩阵和多个第二元素乘积矩阵求和得到第五矩阵,
A 11B 11+A 12B 21 A 11B 12+A 12B 22 A 11B 13+A 12B 23
A 21B 11+A 22B 21 A 21B 12+A 22B 22 A 21B 13+A 22B 23
0 0 0
对第五矩阵进行反扩充处理(将下侧的元素0去掉)可以得到矩阵乘积。
A 11B 11+A 12B 21 A 11B 12+A 12B 22 A 11B 13+A 12B 23
A 21B 11+A 22B 21 A 21B 12+A 22B 22 A 21B 13+A 22B 23
在一种可能的实现方式中,对于上述过程中计算得到的第一元素乘积矩阵和多个第二元素乘积矩阵,可以暂存在临时缓存器中。或者,也可以将第一元素乘积矩阵和多个第二元素乘积矩阵存储在处理元件的寄存器中,比如说,存储在Reg2、Reg3、Reg4(处理元件的其他组寄存器)中,每个处理元件可以对相应寄存器内存储的元素进行相加实现第一元素乘积矩阵和多个第二元素乘积矩阵求和的过程。需要说明的是,以上仅仅是本公开的一些计算第五矩阵的示例,不以任何方式限制本公开。
根据本公开上述各实施方式的矩阵乘的运算方法,更适用于以阵列排布的处理元件组成的处理器,运算效率高。且对于满足处理元件的排列的任意规模的输入矩阵,都可以通过预处理的方式对输入矩阵进行变换,然后进行运算,可以得到矩阵乘法的运算结果。并且,相比于相关技术中的矩阵乘运算可以减少访存次数,降低带宽压力,提高运算的效率。
对于不进行分块的情况,根据上述示例可以直接得到矩阵乘的结果。对于需要进行分块的情况,对于分块后的第一矩阵和第二矩阵,按照矩阵乘的规则将第一矩阵和对应的第二矩阵相乘得到的结果作为第一中间结果,也就是说可以将分块后得到的第一矩阵和第二矩阵作为矩阵的一个元素执行矩阵乘法的运算过程得到第一中间结果,根据第一中间结果进行计算可以得到所述输入矩阵的乘积。
图3-4示出根据本公开一实施例的分块的示意图。如图3-4所示,将矩阵D和E按照以上所述的方式进行分块得到第一矩阵D 11、D 12、D 21、D 22,以及第二矩阵E 11、E 12、E 21、E 22。可以将第一矩阵和第二矩阵作为矩阵的一个元素执行矩阵乘法的运算过程,例如,矩阵D第一行乘以矩阵E第一列为F 11=D 11×E 11+D 12×E 21,矩阵D第一行乘以矩阵E第二列为F 12=D 11×E 12+D 12×E 22,矩阵D第二 行乘以矩阵E第一列为F 21=D 21×E 11+D 22×E 21,矩阵D第二行乘以矩阵E第二列为F 22=D 21×E 12+D 22×E 22。也就是说,为了得到最终的矩阵乘法的运算结果,需要先得到第一中间结果:
D 11×E 11,D 12×E 21,D 11×E 12,D 12×E 22
D 21×E 11,D 22×E 21,D 21×E 12,D 22×E 22
得到第一中间结果的过程可以通过将对应的第一矩阵和第二矩阵分别按照步骤S3-31-步骤S3-34的过程进行运算得到。
通过对输入矩阵进行分块,并针对分块后的矩阵分别进行本公开的矩阵乘法运算得到第一中间结果,根据第一中间结果可以计算得到输入矩阵的乘积。根据本公开上述实施方式的运算方法,对于任何维度的矩阵都可以快速的实现矩阵相乘的过程。且相比于相关技术通过多级流水线实现运算的过程,可以减少访存次数,降低带宽压力,提高运算的效率。
示例3-3
假设处理元件为2×2的阵列,以图3-2a中第(1)种分块的方式为例说明分块计算得到第一中间结果,并根据第一中间结果计算输入矩阵的乘积的过程。
a 11
Figure PCTCN2021075957-appb-000080
a 12
Figure PCTCN2021075957-appb-000081
b 11
Figure PCTCN2021075957-appb-000082
b 21
Figure PCTCN2021075957-appb-000083
b 12
Figure PCTCN2021075957-appb-000084
b 22
Figure PCTCN2021075957-appb-000085
那么a 11×b 11根据步骤S3-31-步骤S3-33的计算过程为:
对于步骤S3-31,由于矩阵a 11和矩阵a 12都是2×2的矩阵,因此,不需要进行扩充。第二预处理过程可以为,对于矩阵a 11来说,控制器不需要对第0行滚动,控制第1行的元素依次向左滚动1步得到的第三矩阵如下:
Figure PCTCN2021075957-appb-000086
对于a 12来说,控制器不需要对第0列滚动,控制第1列的元素依次向上滚动1步得到的第四矩阵如下:
Figure PCTCN2021075957-appb-000087
第三矩阵和第四矩阵对应位置的元素存储在同一个处理元件的寄存器中。例如,第三矩阵存储在处理元件的第一组寄存器Reg0中,第四矩阵存储在处理元件的第二组寄存器Reg1中。元素A 11、元素B 11存储的位置可以是指处理元件PE 11中的寄存器,元素A 12、元素B 22存储的位置可以是指处理元件PE 12中的寄存器,元素A 22、元素B 21存储的位置可以是指处理元件PE 21中的寄存器。
Figure PCTCN2021075957-appb-000088
Figure PCTCN2021075957-appb-000089
控制处理元件对相应的寄存器内的元素进行乘法运算得到第一元素乘积矩阵,第一元素乘积矩阵可以如下所示,
A 11B 11 A 12B 22
A 22B 21 A 21B 12
对于步骤S3-32,仍然以示例3-1为例,将第三矩阵整体向左滚动一步可以得到
Figure PCTCN2021075957-appb-000090
将第四矩阵整体向上滚动一步可以得到
Figure PCTCN2021075957-appb-000091
控制处理元件对相应的寄存器内的元素进行乘法运算得到第二元素乘积矩阵,第二元素乘积矩阵可以如下所示,
A 12B 21 A 11B 12
A 21B 11 A 22B 22
p为2,p-1为1,因此,可以结束滚动的过程。
对于步骤S3-33,将第一元素乘积矩阵和第二元素乘积矩阵求和得到第五矩阵,
A 11B 11+A 12B 21 A 12B 22+A 11B 12
A 22B 21+A 21B 11 A 21B 12+A 22B 22
由于没有对第一矩阵和第二矩阵进行扩充,因此,也不需要进行反向扩充的过程,因此,以上结果就是a 11×b 11的第一中间结果。
对于a 12×b 21,a 11×b 12,a 12×b 22都可以采用步骤S3-31-步骤S3-33的过程得到第一中间结果,然后根据第一中间结果计算输入矩阵的乘积,计算过程为:
C 11=a 11×b 11+a 12×b 21
C 12=a 11×b 12+a 12×b 22
以上就是根据本公开各实施方式的矩阵乘的运算方法,根据以上过程,可以采用分块的方式计算得到输入矩阵的乘积。因此,根据本公开的矩阵乘的运算方法可以实现任意大小规模的矩阵运算。
本公开还提供了一种处理器。图3-1所示为处理器的一个示例,处理器可以包括两个以上处理元件,两个以上处理元件以二维矩阵排列,每个处理元件包括至少一个寄存器,所述处理器用于实现对第一矩阵和第二矩阵的矩阵乘法运算。
所述处理器还包括控制器,所述控制器用于对第一矩阵和第二矩阵进行预处理得到第三矩阵和第四矩阵,其中,第三矩阵和第四矩阵对应位置的元素存储在同一处理元件的寄存器中,第三矩阵和第四矩阵都为p×p矩阵,p=max(m,k,n),m表示第一矩阵的行秩,n表示第二矩阵的列秩,第一矩阵的列秩和第二矩阵的行秩为k,p为m、k、n三者中的最大值;
所述控制器用于对第三矩阵和第四矩阵在行方向或者列方向进行滚动,控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积矩阵;
所述控制器用于根据对第一矩阵和第二矩阵预处理的方式对元素乘积矩阵进行处理得到第一矩阵和第二矩阵的乘积。
在一种可能的实现方式中,所述控制器还用于控制处理元件对相应的寄存器内的元素进行乘法运 算得到第一元素乘积矩阵;
所述控制器重复p-1次以下过程:将第三矩阵整体向左滚动一次、将第四矩阵整体向上滚动一次,或者,将第三矩阵整体向右滚动一次、将第四矩阵整体向下滚动一次,控制处理元件对相应的寄存器内的元素进行乘法运算得到第二元素乘积矩阵。
在一种可能的实现方式中,所述控制器用于将第一元素乘积矩阵和第二元素乘积矩阵求和得到第五矩阵,根据对第一矩阵和第二矩阵预处理的方式对第五矩阵进行处理得到第一矩阵和第二矩阵的乘积。
在一种可能的实现方式中,所述控制器对第一矩阵和第二矩阵的预处理包括:第一预处理和第二预处理,
其中,其中,所述第一预处理指:采用0扩充第一矩阵和第二矩阵的右侧和/或下侧得到p×p矩阵;
所述第二预处理指:对扩充后的p×p矩阵中的元素进行滚动。
在一种可能的实现方式中,对于将第三矩阵整体向左滚动、将第四矩阵整体向上滚动的方式,对应的第二预处理的过程为:将扩充后的第一矩阵的第i行向左滚动i步,将扩充后的第二矩阵的第j列向上滚动j步,其中i、j为自然数,且0≤i≤p-1,0≤j≤p-1。
在一种可能的实现方式中,对于将第三矩阵整体向右滚动、将第四矩阵整体向下滚动的方式,对应的第二预处理的过程为:将扩充后的第一矩阵的第i行向左滚动i-1步,将扩充后的第二矩阵的第j列向上滚动j-1步。
在一种可能的实现方式中,所述控制器还用于根据处理元件的排列以及输入矩阵的行秩以及列秩确定是否对输入矩阵进行分块,其中,输入矩阵包括左乘矩阵和右乘矩阵;
若要对左乘矩阵进行分块,控制器根据处理元件的排列对左乘矩阵的行进行拆分,若要对右乘矩阵进行分块,控制器根据处理元件的排列对右乘矩阵的列进行拆分;
若要对输入矩阵中的两个矩阵都进行分块,控制器根据处理元件的排列以及输入矩阵的行秩和列秩对左乘矩阵列方向和右乘矩阵行方向以相同的方式进行分块;
对左乘矩阵分块后得到两个以上所述第一矩阵,对右乘矩阵分块后得到两个以上所述第二矩阵,或者,对左乘矩阵分块后得到两个以上所述第二矩阵,对右乘矩阵分块后得到两个以上所述第一矩阵。
在一种可能的实现方式中,若左乘矩阵的列数不大于处理元件的列数、右乘矩阵的行数不大于处理元件的行数,左乘矩阵的行数大于处理元件的行数则控制器确定对左乘矩阵进行分块,右乘矩阵的列数大于处理元件的列数,则控制器确定对右乘矩阵进行分块;
若输入矩阵中的左乘矩阵的列数大于处理元件的列数、或者右乘矩阵的行数大于处理元件的行数,则所述控制器对输入矩阵中的两个矩阵都进行分块。
在一种可能的实现方式中,所述控制器还用于按照矩阵乘的规则,根据所述第一矩阵和第二矩阵的乘积计算所述左乘矩阵和所述右乘矩阵的乘积。
对于本实施例的处理器执行矩阵乘法运算的详细过程可参见上文的方法部分,不再赘述。
本公开实施例还提出一种人工智能芯片,所述芯片包括如上所述的处理器。本公开实施例还提出一种运算装置,包括如上所述的处理器。
在一种可能的实现方式中,还公开了一种板卡,其包括存储器件、接口装置和控制器件以及上述人工智能芯片;其中,所述人工智能芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;所述存储器件,用于存储数据;所述接口装置,用于实现所述人工智能芯片与外部设备之间的数据传输;所述控制器件,用于对所述人工智能芯片的状态进行监控。
依据以下条款可更好地理解前述内容:
条款C1.一种基于处理元件矩阵的矩阵乘的运算方法,应用于处理器,所述处理器包括两个以上处理元件,所述两个以上处理元件以二维矩阵排列,处理元件包括至少一个寄存器,所述方法实现对第一矩阵和第二矩阵的矩阵乘法运算,所述方法包括:
对第一矩阵和第二矩阵进行预处理得到第三矩阵和第四矩阵,其中,第三矩阵和第四矩阵对应位置的元素存储在同一处理元件的寄存器中,第三矩阵和第四矩阵都为p×p矩阵,p=max(m,k,n),m表示第一矩阵的行秩,n表示第二矩阵的列秩,第一矩阵的列秩和第二矩阵的行秩为k,p为m、k、n三者中的最大值;
对第三矩阵和第四矩阵在行方向或者列方向进行滚动,控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积矩阵;
根据对第一矩阵和第二矩阵预处理的方式对元素乘积矩阵进行处理得到第一矩阵和第二矩阵的乘积。
条款C2.根据条款C1所述的方法,对第三矩阵和第四矩阵在行方向或者列方向进行滚动,控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积矩阵,包括:
控制处理元件对相应的寄存器内的元素进行乘法运算得到第一元素乘积矩阵;
重复p-1次以下过程:将第三矩阵整体向左滚动一次、将第四矩阵整体向上滚动一次,或者,将第三矩阵整体向右滚动一次、将第四矩阵整体向下滚动一次,控制处理元件对相应的寄存器内的元素进行乘法运算得到第二元素乘积矩阵。
条款C3.根据条款C2所述的方法,根据对第一矩阵和第二矩阵预处理的方式对元素乘积矩阵进行处理得到第一矩阵和第二矩阵的乘积,包括:
将第一元素乘积矩阵和第二元素乘积矩阵求和得到第五矩阵,根据对第一矩阵和第二矩阵预处理的方式对第五矩阵进行处理得到第一矩阵和第二矩阵的乘积。
条款C4.根据条款C1所述的方法,所述对第一矩阵和第二矩阵进行预处理得到第三矩阵和第四矩阵,包括:包括第一预处理和第二预处理,
其中,所述第一预处理指:采用0扩充第一矩阵和第二矩阵的右侧和/或下侧得到p×p矩阵;
所述第二预处理指:对扩充后的p×p矩阵中的元素进行滚动。
条款C5.根据条款C4所述的方法,
对于将第三矩阵整体向左滚动、将第四矩阵整体向上滚动的方式,对应的第二预处理的过程为:将扩充后的第一矩阵的第i行向左滚动i步,将扩充后的第二矩阵的第j列向上滚动j步,其中i、j为自然数,且0≤i≤p-1,0≤j≤p-1。
条款C6.根据条款C4所述的方法,
对于将第三矩阵整体向右滚动、将第四矩阵整体向下滚动的方式,对应的第二预处理的过程为:将扩充后的第一矩阵的第i行向左滚动i-1步,将扩充后的第二矩阵的第j列向上滚动j-1步。
条款C7.根据条款C1-C6任意一项所述的方法,所述方法还包括:
根据处理元件的排列以及输入矩阵的行秩以及列秩确定是否对输入矩阵进行分块,其中,输入矩阵包括左乘矩阵和右乘矩阵;
若要对左乘矩阵进行分块,根据处理元件的排列对左乘矩阵的行进行拆分,若要对右乘矩阵进行分块,根据处理元件的排列对右乘矩阵的列进行拆分;
若要对输入矩阵中的两个矩阵都进行分块,根据处理元件的排列以及输入矩阵的行秩和列秩对左 乘矩阵列方向和右乘矩阵行方向以相同的方式进行分块;
对左乘矩阵分块后得到两个以上所述第一矩阵,对右乘矩阵分块后得到两个以上所述第二矩阵,或者,对左乘矩阵分块后得到两个以上所述第二矩阵,对右乘矩阵分块后得到两个以上所述第一矩阵。
条款C8.根据条款C7所述的方法,根据处理元件的排列以及输入矩阵的行秩以及列秩确定是否对输入矩阵进行分块,包括:
若左乘矩阵的列数不大于处理元件的列数、右乘矩阵的行数不大于处理元件的行数,左乘矩阵的行数大于处理元件的行数则确定对左乘矩阵进行分块,右乘矩阵的列数大于处理元件的列数则确定对右乘矩阵进行分块;
若输入矩阵中的左乘矩阵的列数大于处理元件的列数、或者右乘矩阵的行数大于处理元件的行数,则对输入矩阵中的两个矩阵都进行分块。
条款C9.根据条款C7所述的方法,所述方法还包括:按照矩阵乘的规则,根据第一矩阵和第二矩阵的乘积计算所述左乘矩阵和所述右乘矩阵的乘积。
条款C10.一种处理器,所述处理器包括两个以上处理元件,所述两个以上处理元件以二维矩阵排列,处理元件包括至少一个寄存器,所述处理器用于对第一矩阵和第二矩阵的矩阵乘法运算,所述处理器还包括控制器,所述控制器用于对第一矩阵和第二矩阵进行预处理得到第三矩阵和第四矩阵,其中,第三矩阵和第四矩阵对应位置的元素存储在同一处理元件的寄存器中,第三矩阵和第四矩阵都为p×p矩阵,p=max(m,k,n),m表示第一矩阵的行秩,n表示第二矩阵的列秩,第一矩阵的列秩和第二矩阵的行秩为k,p为m、k、n三者中的最大值;
所述控制器用于对第三矩阵和第四矩阵在行方向或者列方向进行滚动,控制处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积矩阵;
所述控制器用于根据对第一矩阵和第二矩阵预处理的方式对元素乘积矩阵进行处理得到第一矩阵和第二矩阵的乘积。
条款C11.根据条款C10所述的处理器,所述控制器还用于控制处理元件对相应的寄存器内的元素进行乘法运算得到第一元素乘积矩阵;
所述控制器重复p-1次将第三矩阵整体向左滚动一次、将第四矩阵整体向上滚动一次,或者,将第三矩阵整体向右滚动一次、将第四矩阵整体向下滚动一次,控制处理元件对相应的寄存器内的元素进行乘法运算得到第二元素乘积矩阵。
条款C12.根据条款C11所述的处理器,所述控制器用于将第一元素乘积矩阵和第二元素乘积矩阵求和得到第五矩阵,根据对第一矩阵和第二矩阵预处理的方式对第五矩阵进行处理得到第一矩阵和第二矩阵的乘积。
条款C13.根据条款C10所述的处理器,所述控制器对第一矩阵和第二矩阵的预处理包括:第一预处理和第二预处理,
其中,其中,所述第一预处理指:采用0扩充第一矩阵和第二矩阵的右侧和/或下侧得到p×p矩阵;
所述第二预处理指:对扩充后的p×p矩阵中的元素进行滚动。
条款C14.根据条款C13所述的处理器,对于将第三矩阵整体向左滚动、将第四矩阵整体向上滚动的方式,对应的第二预处理的过程为:将扩充后的第一矩阵的第i行向左滚动i步,将扩充后的第二矩阵的第j列向上滚动j步,其中i、j为自然数,且0≤i≤p-1,0≤j≤p-1。
条款C15.根据条款C13所述的处理器,对于将第三矩阵整体向右滚动、将第四矩阵整体向下滚动的方式,对应的第二预处理的过程为:将扩充后的第一矩阵的第i行向左滚动i-1步,将扩充后的第二 矩阵的第j列向上滚动j-1步。
条款C16.根据条款C10-C15任意一项所述的处理器,
所述控制器还用于根据处理元件的排列以及输入矩阵的行秩以及列秩确定是否对输入矩阵进行分块,其中,输入矩阵包括左乘矩阵和右乘矩阵;
若要对左乘矩阵进行分块,控制器根据处理元件的排列对左乘矩阵的行进行拆分,若要对右乘矩阵进行分块,控制器根据处理元件的排列对右乘矩阵的列进行拆分;
若要对输入矩阵中的两个矩阵都进行分块,控制器根据处理元件的排列以及输入矩阵的行秩和列秩对左乘矩阵列方向和右乘矩阵行方向以相同的方式进行分块;
对左乘矩阵分块后得到两个以上所述第一矩阵,对右乘矩阵分块后得到两个以上所述第二矩阵,或者,对左乘矩阵分块后得到两个以上所述第二矩阵,对右乘矩阵分块后得到两个以上所述第一矩阵。
条款C17.根据条款C16所述的处理器,若左乘矩阵的列数不大于处理元件的列数、右乘矩阵的行数不大于处理元件的行数,左乘矩阵的行数大于处理元件的行数则控制器确定对左乘矩阵进行分块,右乘矩阵的列数大于处理元件的列数,则控制器确定对右乘矩阵进行分块;
若输入矩阵中的左乘矩阵的列数大于处理元件的列数、或者右乘矩阵的行数大于处理元件的行数,则所述控制器对输入矩阵中的两个矩阵都进行分块。
条款C18.根据条款C16所述的处理器,所述控制器还用于按照矩阵乘的规则,根据所述第一矩阵和第二矩阵的乘积计算所述左乘矩阵和所述右乘矩阵的乘积。
图4示出根据本公开实施例的板卡的结构框图,参阅图4,上述板卡除了包括上述芯片189以外,还可以包括其他的配套部件,该配套部件包括但不限于:存储器件190、接口装置191和控制器件192;
所述存储器件190与所述人工智能芯片通过总线连接,用于存储数据。所述存储器件可以包括多组存储单元193。每一组所述存储单元与所述人工智能芯片通过总线连接。可以理解,每一组所述存储单元可以是DDR SDRAM(英文:Double Data Rate SDRAM,双倍速率同步动态随机存储器)。
DDR不需要提高时钟频率就能加倍提高SDRAM的速度。DDR允许在时钟脉冲的上升沿和下降沿读出数据。DDR的速度是标准SDRAM的两倍。在一个实施例中,所述存储装置可以包括4组所述存储单元。每一组所述存储单元可以包括多个DDR4颗粒(芯片)。在一个实施例中,所述人工智能芯片内部可以包括4个72位DDR4控制器,上述72位DDR4控制器中64bit用于传输数据,8bit用于ECC校验。
在一个实施例中,每一组所述存储单元包括多个并联设置的双倍速率同步动态随机存储器。DDR在一个时钟周期内可以传输两次数据。在所述芯片中设置控制DDR的控制器,用于对每个所述存储单元的数据传输与数据存储的控制。
所述接口装置与所述人工智能芯片电连接。所述接口装置用于实现所述人工智能芯片与外部设备(例如服务器或计算机)之间的数据传输。例如在一个实施例中,所述接口装置可以为标准PCIE接口。比如,待处理的数据由服务器通过标准PCIE接口传递至所述芯片,实现数据转移。在另一个实施例中,所述接口装置还可以是其他的接口,本公开并不限制上述其他的接口的具体表现形式,所述接口单元能够实现转接功能即可。另外,所述人工智能芯片的计算结果仍由所述接口装置传送回外部设备(例如服务器)。
所述控制器件与所述人工智能芯片电连接。所述控制器件用于对所述人工智能芯片的状态进行监控。具体的,所述人工智能芯片与所述控制器件可以通过SPI接口电连接。所述控制器件可以包括 单片机(Micro Controller Unit,MCU)。如所述人工智能芯片可以包括多个处理芯片、多个处理核或多个处理电路,可以带动多个负载。因此,所述人工智能芯片可以处于多负载和轻负载等不同的工作状态。通过所述控制装置可以实现对所述人工智能芯片中多个处理芯片、多个处理和/或多个处理电路的工作状态的调控。
本公开实施例还提出一种计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现上述方法。计算机可读存储介质可以是非易失性计算机可读存储介质。
本公开实施例还提出一种电子设备,包括上述处理器。
应该理解,上述的实施例仅是示意性的,本公开的装置还可通过其它的方式实现。例如,上述实施例中所述单元/模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。例如,多个单元、模块或组件可以结合,或者可以集成到另一个系统,或一些特征可以忽略或不执行。
另外,若无特别说明,在本公开各个实施例中的各功能单元/模块可以集成在一个单元/模块中,也可以是各个单元/模块单独物理存在,也可以两个或两个以上单元/模块集成在一起。上述集成的单元/模块既可以采用硬件的形式实现,也可以采用软件程序模块的形式实现。
所述集成的单元/模块如果以硬件的形式实现时,该硬件可以是数字电路,模拟电路等等。硬件结构的物理实现包括但不局限于晶体管,忆阻器等等。
所述集成的单元/模块如果以软件程序模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。上述实施例的各技术特征可以进行任意的处理,为使描述简洁,未对上述实施例中的各个技术特征所有可能的处理都进行描述,然而,只要这些技术特征的处理不存在矛盾,都应当认为是本说明书记载的范围。
本公开可以是系统、方法和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质,其上载有用于使处理器实现本公开的各个方面的计算机可读程序指令。
计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是――但不限于――电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的处理。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的处理。这里所使用的计算机可读存储介质不被解释为瞬时信号本身,诸如无线电波或者其他自由传播的电磁波、通过波导或其他传输媒介传播的电磁波(例如,通过光纤电缆的光 脉冲)、或者通过电线传输的电信号。
这里所描述的计算机可读程序指令可以从计算机可读存储介质下载到各个计算/处理设备,或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令,并转发该计算机可读程序指令,以供存储在各个计算/处理设备中的计算机可读存储介质中。
用于执行本公开操作的计算机程序指令可以是汇编指令、指令集架构(ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意处理编写的源代码或目标代码,所述编程语言包括面向对象的编程语言—诸如Smalltalk、C++等,以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络—包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中,通过利用计算机可读程序指令的状态信息来个性化定制电子电路,例如可编程逻辑电路、现场可编程门阵列(FPGA)或可编程逻辑阵列(PLA),该电子电路可以执行计算机可读程序指令,从而实现本公开的各个方面。
这里参照根据本公开实施例的方法、装置(系统)和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的处理,都可以由计算机可读程序指令实现。
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理器,从而生产出一种机器,使得这些指令在通过计算机或其它可编程数据处理装置的处理器执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。
也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上,使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。
附图中的流程图和框图显示了根据本公开的多个实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,所述模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的处理,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的处理 来实现。
以上对本公开实施例进行了详细介绍,本文中应用了具体个例对本公开的原理及实施方式进行了阐述,以上实施例的说明仅用于帮助理解本公开的方法及其核心思想。同时,本领域技术人员依据本公开的思想,基于本公开的具体实施方式及应用范围上做出的改变或变形之处,都属于本公开保护的范围。综上所述,本说明书内容不应理解为对本公开的限制。

Claims (16)

  1. 一种基于处理元件矩阵的矩阵乘的运算方法,其特征在于,应用于处理器,所述处理器包括两个以上处理元件,所述两个以上处理元件以二维矩阵排列,处理元件包括至少一个寄存器,所述方法实现对第一矩阵和第二矩阵的矩阵乘法运算,
    所述方法包括:
    将第一矩阵加载到处理元件的寄存器中;
    针对第二矩阵的每一行,将所述每一行中的元素与第一矩阵的每一列元素对应存储到处理元件的寄存器,与第一矩阵的每一列中的元素分别求乘积,计算一列乘积的和得到第一中间结果;或者,针对第二矩阵的每一列,将所述每一列中的元素与第一矩阵的每一行元素对应存储到处理元件的寄存器,与第一矩阵的每一行中的元素分别求乘积,计算一行乘积的和得到第一中间结果;
    将第一中间结果进行处理得到第一矩阵和第二矩阵的乘积。
  2. 根据权利要求1所述的方法,其特征在于,第一矩阵为左乘矩阵、第二矩阵为右乘矩阵,
    针对第二矩阵中的每一列元素,将该列元素中的每个元素与第一矩阵中对应的一列元素存储到处理元件的寄存器,控制每一个处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,计算每一行元素乘积的和得到第一中间结果,
    其中,第一矩阵中与所述每个元素对应的一列元素是指,该元素在所述第二矩阵中的行数与一列元素的列数相同。
  3. 根据权利要求1所述的方法,其特征在于,第一矩阵为右乘矩阵、第二矩阵为左乘矩阵,
    针对第二矩阵中的每一行元素,将该行元素中的每个元素与第一矩阵中对应的一行元素存储到处理元件的寄存器,控制每一个处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,计算每一列元素乘积的和得到第一中间结果,
    其中,第一矩阵中与所述每个元素对应的一行元素是指,该元素在所述第二矩阵中的列数与一行元素所在的行数相同。
  4. 根据权利要求1-3任意一项所述的方法,其特征在于,所述方法还包括:
    根据处理元件的排列,从输入矩阵中确定不需要进行分块的矩阵为第一矩阵,输入矩阵中的另一矩阵为第二矩阵,输入矩阵包括左乘矩阵和右乘矩阵。
  5. 根据权利要求1-3任意一项所述的方法,其特征在于,所述方法还包括:
    从输入矩阵中确定待加载矩阵;其中,输入矩阵包括左乘矩阵和右乘矩阵,待加载矩阵为左乘矩阵或右乘矩阵;
    根据处理元件的排列以及待加载矩阵的行秩以及列秩确定是否对待加载矩阵进行分块;
    若要对待加载矩阵进行分块,则根据待处理元件的排列以及待加载矩阵的行秩以及列秩对待加载矩阵进行分块得到两个以上第一矩阵。
  6. 根据权利要求5所述的方法,其特征在于,所述方法还包括:
    根据对待加载矩阵分块的方式,对输入矩阵中除了待加载矩阵以外的另一个矩阵进行分块得到两个以上第二矩阵;
    根据第一矩阵和对应的第二矩阵的乘积,按照矩阵乘的规则计算所述左乘矩阵和所述右乘矩阵的乘积。
  7. 根据权利要求5所述的方法,其特征在于,所述处理器包括多组寄存器,所述方法还包括:
    在对所述输入矩阵进行分块后,在所述多组寄存器中堆叠存储所述两个以上第一矩阵,每组存储一个第一矩阵。
  8. 一种处理器,其特征在于,所述处理器包括两个以上处理元件,所述两个以上处理元件以二维矩阵排列,处理元件包括至少一个寄存器,所述处理器用于对第一矩阵和第二矩阵执行矩阵乘法运算,
    所述处理器还包括控制器,所述控制器用于将第一矩阵加载到处理元件的寄存器中;
    针对第二矩阵的每一行,所述控制器用于将所述每一行中的元素与第一矩阵的每一列元素对应存储到处理元件的寄存器,与第一矩阵的每一列中的元素分别求乘积,计算一列乘积的和得到第一中间结果;或者,针对第二矩阵的每一列,所述控制器用于将所述每一列中的元素与第一矩阵的每一行元素对应存储到处理元件的寄存器,与第一矩阵的每一行中的元素分别求乘积,计算一行乘积的和得到第一中间结果;
    所述控制器还用于将第一中间结果进行处理得到第一矩阵和第二矩阵的乘积。
  9. 根据权利要求8所述的处理器,其特征在于,第一矩阵为左乘矩阵、第二矩阵为右乘矩阵,
    针对第二矩阵中的每一列元素,所述控制器用于将该列元素中的每个元素与第一矩阵中对应的一列元素存储到处理元件的寄存器,控制每一个处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,计算每一行元素乘积的和得到第一中间结果,
    其中,第一矩阵中与所述每个元素对应的一列元素是指,该元素在所述第二矩阵中的行数与一列元素的列数相同。
  10. 根据权利要求8所述的处理器,其特征在于,第一矩阵为右乘矩阵、第二矩阵为左乘矩阵,
    针对第二矩阵中的每一行元素,所述控制器用于将该行元素中的每个元素与第一矩阵中对应的一行元素存储到处理元件的寄存器,控制每一个处理元件对相应的寄存器内的元素进行乘法运算得到元素乘积,计算每一列元素乘积的和得到第一中间结果,
    其中,第一矩阵中与所述每个元素对应的一行元素是指,该元素在所述第二矩阵中的列数与一行元素所在的行数相同。
  11. 根据权利要求8-10任意一项所述的处理器,其特征在于,所述处理器还用于根据处理元件的排列,从输入矩阵中确定不需要进行分块的矩阵为第一矩阵,输入矩阵中的另一矩阵为第二矩阵,输入矩阵包括左乘矩阵和右乘矩阵。
  12. 根据权利要求8-10任意一项所述的处理器,其特征在于,所述控制器还用于从输入矩阵中确定待加载矩阵;其中,输入矩阵包括左乘矩阵和右乘矩阵,待加载矩阵为左乘矩阵或右乘矩阵;根据处理元件的排列以及待加载矩阵的行秩以及列秩确定是否对待加载矩阵进行分块;
    若要对待加载矩阵进行分块,则所述控制器用于根据待处理元件的排列以及待加载矩阵的行秩以及列秩对待加载矩阵进行分块得到两个以上第一矩阵。
  13. 根据权利要求12所述的处理器,其特征在于,所述控制器还用于根据对待加载矩阵分块的方式,对输入矩阵中除了待加载矩阵以外的另一个矩阵进行分块得到两个以上第二矩阵;根据第一矩阵和对应的第二矩阵的乘积,按照矩阵乘的规则计算所述左乘矩阵和所述右乘矩阵的乘积。
  14. 根据权利要求12所述的处理器,其特征在于,所述处理器包括多组寄存器,在对所述输入矩阵进行分块后,所述控制器还用于在所述多组寄存器中堆叠存储所述两个以上第一矩阵,每组存储一个第一矩阵。
  15. 一种人工智能芯片,其特征在于,所述芯片包括如权利要求8-14中任意一项所述的处理器。
  16. 一种电子设备,其特征在于,包括如权利要求15所述的人工智能芯片。
PCT/CN2021/075957 2020-04-21 2021-02-08 运算方法、处理器以及相关产品 WO2021212972A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/920,372 US20230169144A1 (en) 2020-04-21 2021-02-08 Operation method, processor, and related product

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
CN202010318387.0A CN113536221B (zh) 2020-04-21 2020-04-21 运算方法、处理器以及相关产品
CN202010317734.8A CN113536219B (zh) 2020-04-21 2020-04-21 运算方法、处理器以及相关产品
CN202010317734.8 2020-04-21
CN202010318380.9A CN113536220A (zh) 2020-04-21 2020-04-21 运算方法、处理器及相关产品
CN202010318387.0 2020-04-21
CN202010318380.9 2020-04-21

Publications (1)

Publication Number Publication Date
WO2021212972A1 true WO2021212972A1 (zh) 2021-10-28

Family

ID=78270293

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/075957 WO2021212972A1 (zh) 2020-04-21 2021-02-08 运算方法、处理器以及相关产品

Country Status (2)

Country Link
US (1) US20230169144A1 (zh)
WO (1) WO2021212972A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102375721A (zh) * 2010-08-23 2012-03-14 联想(北京)有限公司 一种矩阵乘法运算方法、图形处理器和电子设备
CN106445471A (zh) * 2016-10-13 2017-02-22 北京百度网讯科技有限公司 处理器和用于在处理器上执行矩阵乘运算的方法
CN109213962A (zh) * 2017-07-07 2019-01-15 华为技术有限公司 运算加速器
CN110415157A (zh) * 2018-04-26 2019-11-05 华为技术有限公司 一种矩阵乘法的计算方法及装置
US20190339942A1 (en) * 2018-05-04 2019-11-07 Eric B. Olsen Residue number matrix multiplier

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102375721A (zh) * 2010-08-23 2012-03-14 联想(北京)有限公司 一种矩阵乘法运算方法、图形处理器和电子设备
CN106445471A (zh) * 2016-10-13 2017-02-22 北京百度网讯科技有限公司 处理器和用于在处理器上执行矩阵乘运算的方法
CN109213962A (zh) * 2017-07-07 2019-01-15 华为技术有限公司 运算加速器
CN110415157A (zh) * 2018-04-26 2019-11-05 华为技术有限公司 一种矩阵乘法的计算方法及装置
US20190339942A1 (en) * 2018-05-04 2019-11-07 Eric B. Olsen Residue number matrix multiplier

Also Published As

Publication number Publication date
US20230169144A1 (en) 2023-06-01

Similar Documents

Publication Publication Date Title
US11574031B2 (en) Method and electronic device for convolution calculation in neural network
US20190188237A1 (en) Method and electronic device for convolution calculation in neutral network
US20230214652A1 (en) Method and apparatus with bit-serial data processing of a neural network
WO2022037257A1 (zh) 卷积计算引擎、人工智能芯片以及数据处理方法
BR112019000541B1 (pt) Método de reconhecimento de imagem implementado por computador para realizar de modo mais eficiente uma computação de uma camada de rede neural convolucional, sistema de reconhecimento de imagem e meio de armazenamento em computador
US10877812B2 (en) Hardware environment and method of performing matrix multiplication in artificial intelligence applications
US20210319821A1 (en) Integrated Circuit Device with Deep Learning Accelerator and Random Access Memory
US20220108150A1 (en) Method and apparatus for processing data, and related products
WO2023065983A1 (zh) 计算装置、神经网络处理设备、芯片及处理数据的方法
CN111125617A (zh) 数据处理方法、装置、计算机设备和存储介质
WO2021082725A1 (zh) Winograd卷积运算方法及相关产品
WO2021083101A1 (zh) 数据处理方法、装置及相关产品
WO2021212972A1 (zh) 运算方法、处理器以及相关产品
WO2021168644A1 (zh) 数据处理装置、电子设备和数据处理方法
CN112766471B (zh) 运算装置及相关产品
WO2021082723A1 (zh) 运算装置
CN113536221B (zh) 运算方法、处理器以及相关产品
CN113536219B (zh) 运算方法、处理器以及相关产品
CN112766473B (zh) 运算装置及相关产品
CN114463161B (zh) 一种基于忆阻器的神经网络处理连续图像的方法和装置
WO2021082724A1 (zh) 运算方法及相关产品
TWI798591B (zh) 卷積神經網路運算方法及裝置
WO2021082722A1 (zh) 运算装置、方法及相关产品
WO2021169914A1 (zh) 数据量化处理方法、装置、电子设备和存储介质
JP7368512B2 (ja) 計算装置、集積回路チップ、ボードカード、電子デバイスおよび計算方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21793508

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21793508

Country of ref document: EP

Kind code of ref document: A1