CN113536219B

CN113536219B - Operation method, processor and related products

Info

Publication number: CN113536219B
Application number: CN202010317734.8A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2020-04-21
Filing date: 2020-04-21
Publication date: 2024-01-26
Anticipated expiration: 2040-04-21
Also published as: CN113536219A

Abstract

The present disclosure relates to an operation method, a processor and related products. The product comprises a storage device, an interface device, a control device and the artificial intelligent chip; wherein the artificial intelligent chip is respectively connected with the storage device, the control device and the interface device; the storage device is used for storing data; the interface device is used for realizing data transmission between the artificial intelligent chip and external equipment; the control device is used for monitoring the state of the artificial intelligent chip. Through the method or the product, the operation efficiency of the related product in matrix multiplication operation can be improved.

Description

Operation method, processor and related products

Technical Field

The present disclosure relates to the field of information processing technologies, and in particular, to an operation method, a processor, and related products.

Background

In the field of artificial intelligence technology, a neural network algorithm is a machine learning algorithm which is very popular recently, and has very good effects in various fields, such as image recognition, voice recognition, natural language processing and the like. With the development of neural network algorithms, the complexity of the algorithms is also higher and higher, and in order to improve the recognition degree, the scale of the model is also gradually increased. Processing with the GPU and CPU takes a significant amount of computation time and power consumption to start these large-scale models.

Disclosure of Invention

Accordingly, there is a need for an operation method, a processor and related products.

According to a first aspect of the present disclosure, there is provided a processor comprising more than two processing elements arranged in a two-dimensional matrix, the processing elements comprising at least one register, the processor being for performing a matrix multiplication operation on a first matrix and a second matrix,

the processor further comprises a controller, wherein the controller is used for loading elements of a transposed matrix of the first matrix and elements of the second matrix into registers of the processing elements respectively, and the elements of the transposed matrix and the elements of the second matrix corresponding to the positions are stored in the registers of the same processing element;

the controller is used for controlling the transposed matrix or the second matrix to roll in the row direction or the column direction, controlling the processing element to multiply elements in the corresponding register to obtain element products, and summing the element products of the same row or the same column to obtain a first intermediate result;

the controller is further configured to process the first intermediate result to obtain a product of the first matrix and the second matrix.

According to a second aspect of the present disclosure, there is provided a method of operation of matrix multiplication based on a matrix of processing elements, for use in a processor comprising two or more processing elements arranged in a two-dimensional matrix, the processing elements comprising at least one register, the method implementing a matrix multiplication operation on a first matrix and a second matrix, the method comprising:

the first matrix is transposed to obtain a transposed matrix, each element of the transposed matrix and the second matrix is loaded into a register of each processing element, and elements at corresponding positions of the transposed matrix and the second matrix are stored in the register of the same processing element;

controlling the transposed matrix or the second matrix to roll in the row direction or the column direction, controlling the processing element to multiply elements in the corresponding register to obtain element products, and summing the element products of the same row or the same column to obtain a first intermediate result;

and processing the first intermediate result to obtain the product of the first matrix and the second matrix.

According to a third aspect of the present disclosure, there is provided an artificial intelligence chip comprising a processor as described above.

According to a fourth aspect of the present disclosure, there is provided an electronic device comprising an artificial intelligence chip as described above.

According to the matrix multiplication operation method, the processor and other products of the embodiments of the present disclosure, for any scale of input matrices that meet the arrangement of processing elements, the operation result of matrix multiplication can be obtained, and compared with matrix multiplication operation in related technologies, the memory access times can be reduced, the bandwidth pressure is reduced, and the operation efficiency is improved.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features and aspects of the present disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 shows a schematic diagram of a processor according to an embodiment of the present disclosure.

Fig. 2a and 2b show examples of a number of different divisions, respectively.

Fig. 3 shows a flowchart of an operation method according to an embodiment of the present disclosure.

FIG. 4 illustrates a schematic diagram of an array of processing elements according to an embodiment of the present disclosure.

Fig. 5 shows a schematic diagram of a partition according to an embodiment of the present disclosure.

Fig. 6 illustrates an example of partitioning a matrix according to an embodiment of the present disclosure.

Fig. 7 shows a block diagram of a board according to an embodiment of the present disclosure.

Detailed Description

The following description of the technical solutions in the embodiments of the present disclosure will be made clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are some embodiments of the present disclosure, but not all embodiments. Based on the embodiments in this disclosure, all other embodiments that a person skilled in the art would obtain without making any inventive effort are within the scope of protection of this disclosure.

It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, specification, and drawings of this disclosure are used for distinguishing between different objects and not for describing a particular sequential order. The terms "comprises" and "comprising" when used in the specification and claims of this disclosure are taken to specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the present disclosure is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in this disclosure and in the claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the present disclosure and claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

In the process of processing information by utilizing artificial intelligence, matrix operation occupies a relatively large calculated amount, and the existing processor disassembles matrix operation into multiplication operation and addition operation in the process of processing matrix operation, so that data needs to be frequently read from a memory, and the operation efficiency is very low.

In order to solve the above technical problems, the present disclosure provides an operation method and a processor for executing the operation method. The processor may comprise a plurality of processing elements (more than two), which may be arranged in a two-dimensional matrix, each processing element may comprise at least one register.

Fig. 1 shows a schematic diagram of a processor according to an embodiment of the present disclosure. As shown in fig. 1, a plurality of processing elements PE (Processing Element) are arranged in a two-dimensional matrix, and each processing element is connected to an adjacent processing element, and at least one register (not shown) may be provided in each PE. The processor may also include a controller and a memory, wherein the controller and the memory are both coupled to the plurality of processing elements, and the controller may be coupled to the memory. The controller is configured to load input data from the memory into the registers of the processing element and control the processing element to process the input data, for example, the memory may store a first matrix and a second matrix, and the processor is configured to perform a matrix multiplication operation on the first matrix and the second matrix, so that the controller may load the first matrix and the second matrix into the registers of the processing element and control the processing element to perform the matrix multiplication operation.

In one possible implementation, the memory may further have stored therein an executable program, which may include instructions, that may implement a matrix multiplication operation on the first matrix and the second matrix. The controller may be provided with a loader, a decoder, etc., where the loader may be configured to load input data in the memory into a register of the processing element, and the decoder may decode an instruction for accessing data in the executable program according to a storage address of the loaded input data, for example, for an instruction for accessing data, an instruction for accessing data is obtained by decoding an address stored in the register and assigned to the address, and the decoded instruction is sent to the processing element, and the processing element executes the instruction, so as to implement processing of the data, for example, implement matrix multiplication operation on the first matrix and the second matrix.

In one possible implementation, the memory may be an on-chip cache, and the controller may load the executable program on the off-chip flash memory and the input data (e.g., input matrices, including a left-hand matrix and a right-hand matrix) into the memory (on-chip cache) and then perform the subsequent matrix multiplication operations.

In one possible implementation, the controller may also load the input matrix and executable program directly from off-chip memory into registers of the processing element, which is not limited by this disclosure.

An arithmetic unit may be further included in the PE to perform a specified operation, for example, a matrix operation may be included in the PE, for example, a multiplier, an adder, etc., and specific structures in the PEs may be the same or different, which is not limited in this disclosure. Other types of operators may also be included in the PE to accommodate various different operational procedures, and the present disclosure is not limited in the number and type of operators included in the PE.

The input matrices of the matrix multiplication may include a left-hand matrix, which may refer to a matrix located to the left of the multiplication number, and a right-hand matrix, which may refer to a matrix located to the right of the multiplication number.

Since the number and arrangement of PEs in the processor is fixed, the controller may determine whether to block the input matrix based on the arrangement of the processing elements and the rank of the rows and columns of the input matrix before loading data into registers in the processing elements and calculating. The arrangement of the processing elements may refer to the number of rows and columns of the processing elements, and the row rank and column rank of the input matrix may refer to the number of rows and columns of the left-hand matrix and the right-hand matrix.

The controller determining whether to block the input matrix according to the arrangement of the processing elements and the row rank and column rank of the input matrix may refer to: the controller judges whether the number of lines of the input matrix or the transpose of the input matrix is larger than the number of lines of the processing element, and whether the number of columns is larger than the number of columns of the processing element, and determines whether to block the input matrix according to the judging result.

If the number of rows of one of the input matrices is not greater than the number of rows of the processing elements and the number of columns is not greater than the number of columns of the processing elements, and the number of columns of the transpose of the other of the input matrices is not greater than the number of rows of the processing elements and the number of columns is not greater than the number of columns of the processing elements, then the input matrices may not be partitioned.

If the number of rows of any one of the input matrices is greater than the number of rows of the processing elements, or the number of columns is greater than the number of columns of the processing elements, or the number of columns of the transpose of any one of the input matrices is greater than the number of rows of the processing elements, or the number of columns is greater than the number of columns of the processing elements, the controller may block the input matrices.

For example, assume that an array of processing elements may be represented as a PE _MN Representing the processing elements forming an M N matrix, M representing the number of rows of the matrix, N representing the number of columns of the matrix, assuming an input matrix of A _mn Representing an m n matrix, m representing the number of rows of the matrix, n representing the number of columns of the matrix, the other input matrix being B _nk Represents an n x k matrix, n represents the number of rows of the matrix, and k represents the number of columns of the matrix. If matrix A _mn The number M of rows is not greater than the number M of columns of processing elements, and the number N of columns is not greater than the number N of columns of processing elements, and B _nk Transposed matrix of (a)The input matrix may not be partitioned if the number k of rows is not greater than the number M of rows of processing elements and the number N of columns is not greater than the number N of columns of processing elements. Alternatively, if A _mn Transposed matrix of->The number of rows N is not greater than the number of columns M of the processing element and the number of columns M is not greater than the number of columns N of the processing element, and B _nk The input matrix may not be partitioned if the number N of rows is not greater than the number M of rows of processing elements and the number k of columns is not greater than the number N of columns of processing elements.

If matrix A _mn The number M of rows is greater than the number M of columns of processing elements, or the number N of columns is greater than the number N of columns of processing elements, or the matrix B _nk Is transposed of (a)The input matrix can be partitioned if the number k of the rows is larger than the number M of the rows of the processing elements or the number N of the columns is larger than the number N of the columns of the processing elements; or if->The number of rows N being greater than the number of rows M of the processing element, or the number of columns M being greater than the number of columns N of the processing element, or B _nk The input matrix may be partitioned if the number N of rows is greater than the number M of columns of processing elements or the number k of columns is greater than the number N of columns of processing elements.

To block one of the input matrices, the controller may split the rows of the left-hand matrix or split the columns of the right-hand matrix according to the arrangement of the processing elements.

For example, assume that the array of processing elements is PE ₂₂ The left-hand matrix is A ₃₂ The right multiplication matrix is B ₂₂ Then A can be ₃₂ Split into A ₁₂ 、A ₂₂ Respectively with B ₂₂ Multiplying. If the left-hand matrix is A ₂₂ The right multiplication matrix is B ₃₂ Then B can be ₃₂ Split into B ₁₂ 、B ₂₂ 。

To block both of the input matrices, the controller may block the left-hand matrix row direction and the right-hand matrix row direction in the same manner, depending on the arrangement of the processing elements and the row and column ranks of the input matrices.

That is, the left-hand matrix and the transposed right-hand matrix may be partitioned in the same manner in the column direction, or the transposed left-hand matrix and right-hand matrix may be partitioned in the same manner in the row direction, where the same manner of partitioning means that the number of columns or rows of the first matrix and the second matrix obtained after the partitioning are the same, so as to ensure that the matrix operation can be normally completed.

It is assumed that the division of the left-hand matrix may result in more than two first matrices, the division of the right-hand matrix may result in more than two second matrices, or the division of the right-hand matrix may result in more than two first matrices, and the division of the left-hand matrix may result in more than two second matrices.

According to the arrangement of the processing elements and the row rank and the column rank of the input matrix, the left multiplication matrix direction and the right multiplication matrix row direction are partitioned in the same mode, and the first matrix and the second matrix obtained after the partitioning are required to meet the condition that the partitioning is not required any more, that is, the transposed row number of the first matrix and the second matrix is not greater than the row number of the processing elements and the column number is not greater than the column number of the processing elements, or the transposed row number of the first matrix and the column number of the second matrix are not greater than the row number of the processing elements and the column number is not greater than the column number of the processing elements.

In one possible implementation manner, the controller may divide the first matrix or the second matrix according to a manner that a row rank and a column rank of the divided first matrix or the divided second matrix are as close to a row number and a column number of the processing element as possible, so that operation efficiency may be improved, and operation time may be shortened. That is, assuming that the processing elements are 4×4 arrays, the processing elements may be first divided in such a manner that the divided matrix is 4×4, so that the processing elements may be utilized most efficiently, and the operation efficiency may be improved.

For example, assume a 2×2 array of processing elements, with one input matrix being a 2×4 matrix and one being a 4×3 matrix. The division can be varied in many ways, and fig. 2a and 2b show various different division ways, respectively, matrix a ₂₄ In the column direction and matrix B ₄₃ The blocking is performed in the same manner in the row direction. FIG. 2a is an example of partitioning, matrix A ₂₄ Divided into two parts in the column direction, each part comprising two columns, matrix B ₄₃ Dividing the line direction into two parts, wherein each part comprises two lines; FIG. 2b is another example of partitioning, matrix A ₂₄ Divided into three parts in the column direction, one part comprising two columns and the other two parts comprising one column, matrix B ₄₃ The row direction is divided into three parts, one part comprising two rows and the other two parts comprising one row. The arrangement of the above processing elements and the manner of division of the input matrix are merely one example of the present disclosure, and do not limit the present disclosure in any way.

The present disclosure is not particularly limited as to the division manner of the row direction of the left-hand matrix and the column direction of the right-hand matrix, as long as the divided matrices are required to satisfy the condition that no further blocking is required.

According to the operation rule of matrix multiplication, elements in the rows of the left-hand matrix and elements in the columns of the right-hand matrix are multiplied one by one and then summed. Thus, in one possible implementation manner, for the case of no blocking, or the first matrix and the corresponding second matrix after blocking, the controller is configured to load each element of the transpose matrix of the first matrix and each element of the second matrix into a register of each processing element, respectively, where the elements of the transpose matrix and the corresponding position of the second matrix are stored in the registers of the same processing element. According to the matrix multiplication rule, the elements in the positions corresponding to the transposed matrix and the second matrix may refer to elements in the transposed matrix and elements in the second matrix, where multiplication is required.

In one possible implementation, the controller may transpose the first matrix to obtain a transposed matrix, and then load elements of the transposed matrix into registers of each processing element, or in another possible implementation, the controller may transpose the first matrix during loading, for example, assuming that the first matrix is a right-hand matrix, the controller may transpose a column of elements of the first matrix into registers of a row of processing elements during loading of the first matrix into registers of each processing element.

In one possible implementation, the transpose matrix and the second matrix are aligned in a row or column direction. Specifically, if the multiplied matrix is transposed, then, after loading, the rows of the transposed matrix of the first matrix are aligned with the rows of the second matrix in the column direction, i.e., in the column direction; if the right-hand matrix is transposed, then after loading, the columns of the transposed matrix are aligned with the second matrix in the row direction, that is, in the row direction, the columns of the transposed matrix and the second matrix are aligned.

After the transpose matrix and the second matrix are loaded, the controller is further configured to control elements in the transpose matrix or the second matrix to roll in a row direction or a column direction, control the processing element to multiply elements in the corresponding register to obtain element products, and sum the element products in the same row or the same column to obtain a first intermediate result. Specifically, the controller controls the processing element, the transpose matrix stored in the register, and the second matrix to repeat the following process until the elements in the transpose matrix or the second matrix are restored to the positions when not scrolled: the controller controls the processing element to multiply the elements in the corresponding register to obtain element products, sums the element products of the same row or column to obtain a first intermediate result, and controls the transposed matrix or the second matrix stored in the register to scroll by one row or one column in the row direction or the column direction.

That is, the processing element is controlled to multiply the elements in the corresponding register to obtain element products, sum the element products of the same row or column to obtain a first intermediate result, and then control the elements in the transposed matrix or the second matrix to roll one row or column in the row direction, where it can be judged whether the elements in the transposed matrix or the second matrix are the same as the initial position after the rolling is completed, where the initial position may be a position when the elements in the transposed matrix or the second matrix are not rolled. If the determination result is the same, the process is ended. If the judging result is different, then controlling the processing element to multiply the elements in the corresponding register to obtain element products, summing the element products of the same row or column to obtain a first intermediate result, then controlling the elements in the transposed matrix or the second matrix to roll one row or column in the row direction, judging whether the elements in the transposed matrix or the second matrix are the same as the initial position … … after the rolling, and cycling the above processes until the elements in the transposed matrix or the second matrix are the same as the initial position after the rolling.

In one example, the first matrix is a left-hand matrix and the second matrix is a right-hand matrix. In another example, the first matrix is a right-hand matrix and the second matrix is a left-hand matrix.

When the first matrix is a left-square matrix and the second matrix is a right-square matrix, the controller controls elements in the transposed matrix to roll in the row direction or controls elements in the second matrix to roll in the row direction, the processing element is controlled to multiply the elements in the corresponding register to obtain element products, and the element products in the same column are summed to obtain a first intermediate result.

When the first matrix is a right-square matrix and the second matrix is a left-square matrix, the controller controls elements in the transposed matrix to roll in the column direction or controls elements in the second matrix to roll in the column direction; the control processing element performs multiplication operation on elements in the corresponding register to obtain element products, and sums the element products of the same row to obtain a first intermediate result.

In one possible implementation, the scrolling described above scrolls one row or one column at a time. A closed loop is formed between the processing elements storing the elements of the matrix, and since adjacent processing elements are connected together, the controller can determine the manner of looping according to the dimensions of the matrix, for example, if the elements of the first row and the last row of processing elements storing the matrix are to be scrolled in rows (in column direction), during which the elements of the first row of the matrix are to be scrolled from the original stored position to the position where the elements of the last row are to be stored if one row is to be scrolled upwards. If the column is to be scrolled (scrolled in the row direction), then the first column of processing elements storing elements of the matrix and the last column of processing elements are connected, and during scrolling, if one column is scrolled to the left, then the first column of elements of the matrix scrolls from the original stored location to the location where the last column of elements is stored. The connection between the processing elements may be referred to as a virtual connection, i.e. there is no actual connection, but the controller registers the corresponding processor and a closed loop is formed during the scrolling.

After completing the process of scrolling and calculating the first intermediate result when the elements in the transposed matrix or the second matrix are restored to the positions when not scrolled, the controller may process the first intermediate result to obtain the product of the first matrix and the second matrix.

In one possible implementation, the controller stores the first intermediate result in rows or columns, and scrolls in the row direction or column direction to obtain the product of the first matrix and the second matrix. The specific processing mode is related to the transposed matrix and the scrolling direction, for example:

when the first matrix is a right-square matrix and the second matrix is a left-square matrix, the first intermediate result can be stored in columns and the elements in the first intermediate result can be scrolled to the right in the row direction under the condition that the transposed matrix is scrolled upwards in the column direction; for example, the i-th row element scrolls i-1 steps to the right in the row direction;

when the first matrix is a right-square matrix and the second matrix is a left-square matrix, the first intermediate result can be stored in columns and the elements in the first intermediate result can be rolled leftwards in the row direction under the condition that the transposed matrix is rolled downwards in the column direction; for example, the i-th row element scrolls left in the row direction by i-1 steps;

When the first matrix is a left-hand matrix and the second matrix is a right-hand matrix, storing the first intermediate result according to the rows under the condition that the transposed matrix rolls leftwards in the row direction, and rolling the ith column element in the first intermediate result downwards in the column direction for i-1 steps to obtain the product of the input matrix;

when the first matrix is a left-hand matrix and the second matrix is a right-hand matrix, the first intermediate result can be stored in rows under the condition that the transposed matrix scrolls right in the row direction, and the i-th column element in the first intermediate result scrolls i-1 steps in the column direction to obtain the product of the input matrix.

In the related art, for matrix multiplication with a relatively large input matrix size, in order to improve the efficiency of matrix operation, a multi-stage pipeline mode is generally adopted to implement the operation process, but since each stage of the multi-stage pipeline processes a part of input data, the data needs to be frequently read from a memory, and the requirement on bandwidth is relatively high due to frequent access to the memory. In order to solve the technical problem, the processor provided by the disclosure can stack and store the input matrix after blocking, and simultaneously perform matrix multiplication operation on the corresponding matrix after blocking, so that the memory access frequency can be reduced, and the operation efficiency is improved.

If the first matrix is obtained by blocking according to a left-hand matrix or the second matrix is obtained by blocking according to a right-hand matrix, in a possible implementation, the controller is further configured to calculate a product of the left-hand matrix and the right-hand matrix based on a product of the first matrix and the second matrix. That is, the products of the first matrix and the second matrix are calculated for the first matrix and the corresponding second matrix after the blocking, respectively, and then the products of the left-hand matrix and the right-hand matrix are calculated from the products of the first matrix and the second matrix. Therefore, the access frequency can be reduced, and the operation efficiency is improved.

In another possible implementation, the processor includes multiple sets of registers. That is, the controller may divide the registers of the processing elements into a plurality of groups according to the case of blocking the matrix.

In this way, the controller may transpose more than two of the first matrices to obtain transposed matrices after partitioning the input matrices; the controller loads the transposed matrix and more than two second matrices into the plurality of groups of registers to be stacked and stored, wherein the transposed matrix and the second matrix at corresponding positions are stored in one group of registers.

Before each time the elements in the transposed matrix or the second matrix are rolled once in the row direction or the column direction, the controller controls the processing elements to multiply the elements in the corresponding registers to obtain element products, and the element products in the same row or the same column are summed to obtain a first intermediate result; the controller also corrects the scrolling result after controlling the elements in a set of registers to scroll a row or column transpose matrix in the row or column direction.

In one possible implementation, correcting the scrolling result includes:

if the data scroll to the left in the row direction, the correction mode is that the last column of data in each block of transposed matrix after the scrolling is scrolled to the last column of the adjacent previous block of transposed matrix data;

if the data scroll to the right in the row direction, the correction mode is that the first column of data in each block of transposed matrix after the scrolling is scrolled to the first column of the adjacent next block of transposed matrix data;

if the data scroll upwards in the column direction, the correction mode is that the last line of data in each block of transposed matrix after the scrolling is scrolled to the last line of the adjacent previous block of transposed matrix data;

if the data scroll downwards in the column direction, the correction mode is that the data of the first row in each block of transposed matrix after the scrolling is scrolled to the data of the first row of the adjacent next block of transposed matrix;

Wherein, each block transposed matrix refers to a matrix after transposed for each block matrix after the block. Specific calculation and correction procedures are described in detail in the examples below.

The present disclosure also provides an operation method for implementing matrix multiplication operations.

For the case of no blocking, or the first matrix and the second matrix after blocking, fig. 3 shows a flowchart of an operation method according to an embodiment of the present disclosure. For the case of no blocking, the left-hand matrix may be directly used as the first matrix and the right-hand matrix may be directly used as the second matrix, or the left-hand matrix may be directly used as the second matrix and the right-hand matrix may be directly used as the first matrix, which is not limited in the present disclosure.

As shown in fig. 3, the operation method provided in the present disclosure may include the following steps:

step S11, the first matrix is transposed to obtain a transposed matrix, the transposed matrix and the second matrix are loaded into a register of a processing element, and elements at corresponding positions of the transposed matrix and the second matrix are stored in the register of the same processing element.

According to the matrix multiplication rule, the elements in the positions corresponding to the transposed matrix and the second matrix may refer to elements in the transposed matrix and elements in the second matrix, where multiplication is required.

And step S12, controlling the transposed matrix or the second matrix to roll in the row direction or the column direction, controlling the processing element to multiply the elements in the corresponding register to obtain element products, and summing the element products of the same row or the same column to obtain a first intermediate result.

In one possible implementation, step S12 may specifically include repeating the following process until the elements in the transposed matrix or the second matrix recover to the position when not scrolled: the control processing element performs multiplication operation on elements in the corresponding register to obtain element products, and sums the element products of the same row or the same column to obtain a first intermediate result; the transpose matrix or the second matrix is scrolled one row or one column in the matrix of processing elements in the row or column direction.

And step S13, processing the first intermediate result to obtain the product of the first matrix and the second matrix.

That is, for steps S12 and S13, the processing element is controlled to multiply the elements in the corresponding registers to obtain element products, sum the element products of the same row or column to obtain a first intermediate result, and then control the elements in the transposed matrix or the second matrix to scroll by one row or one column in the row direction or the column direction, where it may be determined whether the elements in the transposed matrix or the second matrix are the same as the initial positions after the scrolling is completed, where the initial positions may refer to positions when the elements in the transposed matrix or the second matrix are not scrolled. If the determination result is the same, the process is ended, and the step S13 is continued. If the judging result is different, then controlling the processing element to multiply the elements in the corresponding register to obtain element products, summing the element products of the same row or column to obtain a first intermediate result, then controlling the elements in the transposed matrix or the second matrix to roll one row or column in the row direction, judging whether the elements in the transposed matrix or the second matrix are the same as the initial position … … after the rolling, and cycling the above processes until the elements in the transposed matrix or the second matrix are the same as the initial position after the rolling.

When the first matrix is a left-square matrix and the second matrix is a right-square matrix, in step S12, the elements in the transposed matrix are controlled to roll in the row direction, or the elements in the second matrix are controlled to roll in the row direction, the processing elements are controlled to multiply the elements in the corresponding registers to obtain element products, and the element products in the same column are summed to obtain a first intermediate result.

When the first matrix is a right-square matrix and the second matrix is a left-square matrix, in step S12, the elements in the transposed matrix are controlled to roll in the column direction or the elements in the second matrix are controlled to roll in the column direction, the processing elements are controlled to multiply the elements in the corresponding registers to obtain element products, and the element products of the same row are summed to obtain a first intermediate result.

In one possible implementation, the scrolling described above scrolls one row or one column at a time.

For step S13, processing the first intermediate result may refer to: and storing the first intermediate result in rows or columns, and rolling in the row direction or the column direction to obtain the product of the first matrix and the second matrix. The specific processing mode is related to the transposed matrix and the scrolling direction, for example:

When the first matrix is a right-square matrix and the second matrix is a left-square matrix, the first intermediate result can be stored in columns and the elements in the first intermediate result can be rolled right in the row direction under the condition that the transposed matrix is rolled upwards in the column direction; for example, the i-th row element scrolls i-1 steps to the right in the row direction;

The process of steps S11 to S13 will be described below by taking the first matrix as a right-hand matrix, the second matrix as a left-hand matrix, and the first matrix as a left-hand matrix and the second matrix as a right-hand matrix as examples, respectively.

Example 1 the first matrix is a right-hand matrix and the second matrix is a left-hand matrix, that is, the right-hand matrix is transposed.

Let it be assumed that the first matrix b _nk And a second matrix a _mn Are all 3 x 3 matrices and the processing elements form a 4 x 4 array.

FIG. 4 illustrates a schematic diagram of an array of processing elements according to an embodiment of the present disclosure. The operation method of the present disclosure will be described with reference to fig. 4 and 3.

Assuming a first matrixSecond matrix->Then the transposed matrix obtained by transposing the first matrix is +.>

The second matrix may be loaded into the registers of the processing elements in a row and column arrangement of the second matrix, i.e. the elements of the second matrix may be arranged in the same manner as the elements of the second matrix are arranged in the registers of the processing elements.

In one possible implementation, the number of rows and columns in the matrix of elements in the second matrix is the same as the number of rows and columns in the array of processing elements loaded with the element.

For example, in one example, a may be ₁₁ Loading into PE ₁₁ In the register of (A) ₁₂ Loading into PE ₁₂ In the register of (A) ₁₃ Loading into PE ₁₃ In the register of (A) ₂₁ Loading into PE ₂₁ … A in the register of (2) ₃₃ Loading into PE ₃₃ That is, the index of an element in the second matrix may be identical to the index of the processing element in which it is located.

In another example, A may be ₁₁ Loading into PE ₁₂ In the register of (A) ₁₂ Loading into PE ₁₃ In the register of (A) ₁₃ Loading into PE ₁₄ In the register of (A) ₂₁ Loading into PE ₂₂ … A in the register of (2) ₃₃ Loading into PE ₃₄ That is, the elements in the second matrix are arranged in the same manner as the elements in the registers of the processing elements.

It should be noted that the above examples are only some examples of loading the first matrix, and not limit the disclosure in any way, and those skilled in the art should know that the arrangement of the elements in the first matrix in the matrix is satisfied as long as the arrangement in the registers of the processing elements is the same.

The transpose matrix may be loaded into the registers of the processing elements according to the manner in which the first matrix is loaded, or, after loading, the columns of the second matrix are aligned with the columns of the transpose matrix, and after loading, the transpose matrix and the elements in the corresponding positions of the second matrix are stored in the registers of the same processing element.

For example, suppose that a will be ₁₁ Loading into PE ₁₁ In the register of (A) ₁₂ Loading into PE ₁₂ In the register of (A) ₁₃ Loading into PE ₁₃ In the register of (A) ₂₁ Loading into PE ₂₁ … A in the register of (2) ₃₃ Loading into PE ₃₃ That is, the index of an element in the first matrix may be identical to the index of the processing element in which it is located. Then B can be ₁₁ Loading into PE ₁₁ In the register of (B) ₂₁ Loading into PE ₁₂ In the register of (B) ₃₁ Loading into PE ₁₃ In the register of (B) ₁₂ Loading into PE ₂₁ In the register of (B) ₂₂ Loading into PE ₂₂ In the register of (B) ₃₂ Loading into PE ₂₃ … … B in a register of (2) ₃₃ Loading into PE ₃₃ Is provided in the register of (a). That is, the transpose matrix is loaded into the registers of the processing elements in an ordered fashion aligned with the second matrix column.

In one possible implementation manner, the transpose matrix may be loaded first and then the second matrix may be loaded, or the loading may be performed simultaneously, so long as it is ensured that the transpose matrix and the second matrix are aligned in the row direction after loading, and elements corresponding to the transpose matrix and the second matrix are stored in registers of the same processing element.

In one possible implementation, after loading the input matrix, for the case of transpose of the right-hand multiplication matrix, the processing element storing the first row element of the transpose matrix and the processing element storing the last row element of the transpose matrix may be connected in the column direction to form a loop, and data within the loop may be streamed to achieve scrolling of the matrix in the column direction. As shown in fig. 1, the PE may be ₁₁ With PE ₃₁ Connected to form a ring, connected to PE ₁₂ And PE (polyethylene) ₃₂ Can form a ring, connect PE ₁₃ And PE (polyethylene) ₃₃ Can be formed intoA ring. Thus, when data flows in the ring, if it is flowing upward, data of the first row will flow to the third row, data of the second row will flow to the first row, and data of the third row will flow to the second row; if it is down flow, then the data for the first row will flow to the second row, the data for the second row will flow to the third row, and the data for the third row will flow to the first row.

In this embodiment, only the transpose matrix may be scrolled, and before the first scrolling is performed on the transpose matrix, the controller may control the processing element to multiply the element processes in the corresponding registers to obtain element products, and sum the element products in the same row to obtain the first intermediate result. Taking the above example as an example, the controller may control the PE ₁₁ Element a stored for registers therein ₁₁ And B ₁₁ Multiplication is carried out to obtain element product A ₁₁ ×B ₁₁ Likewise, the controller may control the PE ₁₂ 、PE ₁₃ To obtain A ₁₂ ×B ₂₁ 、A ₁₃ ×B ₃₁ ，

The controller may then sum the element products in the same row to obtain C ₁₁ ＝A ₁₁ ×B ₁₁ +A ₁₂ ×B ₂₁ +A ₁₃ ×B ₃₁ ；

C can be obtained in the same manner ₂₂ And C ₃₃ 。

In one possible implementation, C may be ₁₁ 、C ₂₂ And C ₃₃ The first intermediate result is temporarily stored in a buffer as a first column. The buffer may be located in a position other than the plurality of processing elements in the processor.

Next, in one possible implementation, the transpose matrix may be scrolled up one row, with the elements of the first row scrolled to the last row (of the processing elements storing the elements of the matrix). Alternatively, the transpose matrix may be scrolled down by one line, and the specific scroll direction is not limited in the present disclosure, and the scrolling may be performed in units of lines in the column direction for the example in the present embodiment.

As shown in fig. 1, when scrolling up, the data of the first row may scroll to the third row as follows:

in one possible implementation, the scrolling of data in the matrix may be implemented using redundant registers within the processing element or on-chip caches in the processor. This embodiment is applicable to the scrolling process in examples 1 and 2 of the present disclosure.

For example, taking the example 1 as an example, the first row of elements of the transpose matrix may be temporarily stored in the redundant registers, the processing element of the second row is controlled to send the second row of elements of the transpose matrix stored in the corresponding registers to the processing element of the first row, then the processing element of the third row is controlled to send the third row of elements of the transpose matrix stored in the corresponding registers to the processing element of the second row, and finally the temporarily stored first row of elements may be stored in the corresponding registers of the processing element of the third row, thereby implementing the scrolling process of one row of data of the transpose matrix. The above process is merely one example of the present disclosure and is not intended to limit the present disclosure in any way.

Performing multiplication operation on the elements in the corresponding registers by the control processing element again to obtain element products, summing the element products of the same row to obtain a first intermediate result, a ₃₃ Is multiplied by the first row of (2)C is obtained in the second row of (2) ₁₂ 、a ₃₃ The second row multiplied by +.>C is obtained in the third line of (2) ₂₃ And a ₃₃ Third line multiplied by +.>C is obtained from the first line of (2) ₃₁ . C is C ₁₂ 、C ₂₃ And C ₃₁ As the second oneThe column first intermediate result is temporarily stored in a buffer.

Scrolling one row of transpose matrix upwards again, multiplying the element process in the corresponding register to obtain element product, summing the element products of the same row to obtain a first intermediate result C ₁₃ 、C ₂₁ And C ₃₂ C is carried out by ₁₃ 、C ₂₁ And C ₃₂ The first intermediate result is temporarily stored in a buffer as a third column.

That is, the first intermediate result stored in the buffer is

For step S13, for the case of scrolling the transposed matrix upward, the processing of the first intermediate result means that the controller stores the obtained first intermediate result in columns, and then the controller scrolls the i-th row element in the first intermediate result to the right in the row direction by i-1 step to obtain the product of the input matrix, where the scrolling also means that the scrolling forms a closed loop in the row direction, and the first column of processing elements storing the elements of the matrix and the last column of processing elements are connected to form a closed loop. During scrolling, if scrolling to the right, the elements stored in the last column of processing elements scroll into the first column of processing elements.

Optionally, for step S13, for the case of scrolling the transpose matrix downward, the processing the first intermediate result means that the controller stores the obtained first intermediate result in columns, and then the controller scrolls the i-th row element in the first intermediate result in the row direction to the left by i-1 steps to obtain the product of the input matrix.

It will be appreciated by those skilled in the art that for step S13, the product of the input matrix may also be obtained by the controller by scrolling the elements in the first intermediate result in the row direction (e.g., scrolling right or scrolling left) based on the row-column identification of the first intermediate result. In this embodiment, the elements stored in the register may all carry row and column identifiers of the elements in the matrix, and during the scrolling, the row and column identifiers of the elements in the first intermediate result are determined according to the row and column identifiers of the elements in the matrix, so that the controller may scroll the elements in the first intermediate result in the row direction according to the row and column identifiers of the first intermediate result to obtain the product of the first matrix and the second matrix.

Taking the above example as an example, line 1 scrolls 0 steps to the right, i.e. does not scroll. Line 2 scrolls to the right 1 step, that is C ₂₁ Scrolling to the right by 1 step to column 1, C ₂₃ Scrolling to the right by 1 step to column 3, C ₂₂ Scrolling to the right, 1 step to column 2, yields the following results:

rolling line 3 to the right for 2 steps, the product of the obtained input matrices is:

in a possible implementation, in step S12, the second matrix may also be scrolled in the column direction, and the specific process is similar to that of the transpose matrix scroll, except for a slight difference in the manner of processing and scrolling the elements in step S13. The present disclosure will not be repeated for specific deriving processes, and refer to the above processes.

It should be noted that the arrangement of the processing elements, the input matrix, etc. in the above examples are merely for clarity of illustration of the process of the disclosed operation method, and do not limit the present disclosure in any way.

Example 2 the first matrix is a left-hand matrix and the second matrix is a right-hand matrix, i.e. the left-hand matrix is transposed

Still assume the first matrix a _mn And a second matrix b _nk Are all 3 x 3 matrices and the processing elements are 4 x 4 arrays.

Assuming a first matrixThen transpose the first matrix intoSecond matrix->

The second matrix is loaded into the register of the output processing element, and the loading manner may be referred to as the manner of loading the first matrix in example 1, which is not described herein, and then the transpose matrix is loaded into the register of the processing element according to the manner of loading the second matrix, and after loading, the rows of the transpose matrix of the first matrix are aligned with the rows of the second matrix.

For example, suppose that B ₁₁ Loading into PE ₁₁ In the register of (B) ₁₂ Loading into PE ₁₂ In the register of (B) ₁₃ Loading into PE ₁₃ In the register of (B) ₂₁ Loading into PE ₂₁ … B in a register of (2) ₃₃ Loading into PE ₃₃ That is, the index of an element in the first matrix may be identical to the index of the processing element in which it is located. Then A can be ₁₁ Loading into PE ₁₁ In the register of (A) ₂₁ Loading into PE ₁₂ In the register of (A) ₃₁ Loading into PE ₁₃ In the register of (A) ₁₂ Loading into PE ₂₁ In the register of (A) ₂₂ Loading into PE ₂₂ In the register of (A) ₃₂ Loading into PE ₂₃ … … A in the register of (2) ₃₃ Loading into PE ₃₃ Is provided in the register of (a). That is, the transpose matrix is loaded into the registers of the processing elements in a row-aligned ordering with another matrix (the second matrix).

In one possible implementation, after loading the input matrix, for the case of transposing the first matrix, the processing element storing the first column element of the transposed matrix and the processing element storing the last column element of the transposed matrix may be connected in the row direction to form a loop, and data within the loop may flow, thereby facilitatingScrolling is performed in units of columns in the direction of the rows. As shown in fig. 4, the connection PE ₁₁ And PE (polyethylene) ₁₃ Can form a ring, connect PE ₂₁ And PE (polyethylene) ₂₃ Can form a ring, connect PE ₃₁ And PE (polyethylene) ₃₃ A ring may be formed such that when data flows within the ring, if it flows to the left, data of the first column will flow to the third column, data of the second column will flow to the first column, and data of the third column will flow to the second column; if it is flowing to the right, then the data of the first column will flow to the second column, the data of the second column will flow to the third column, and the data of the third column will flow to the first column.

In this embodiment, only the transpose matrix may be scrolled, and before the transpose matrix is scrolled to the left or right for the first time according to the column direction, the controller may control the processor element to multiply the elements in the corresponding register to obtain element products, and sum the element products in the same column to obtain the first intermediate result. Taking the above example as an example, PE ₁₁ Element a stored for registers therein ₁₁ And B ₁₁ Multiplication is carried out to obtain element product A ₁₁ ×B ₁₁ Likewise, A can be obtained ₁₂ ×B ₂₁ 、A ₁₃ ×B ₃₁ 。

The sum of the element products of the first column can obtain C ₁₁ ＝A ₁₁ ×B ₁₁ +A ₁₂ ×B ₂₁ +A ₁₃ ×B ₃₁ ；

The sum of the element products C of the second column can be obtained in the same way ₂₂ Sum of element products of third column C ₃₃ 。

In one possible implementation, C may be ₁₁ 、C ₂₂ And C ₃₃ The first intermediate result is temporarily stored in a buffer as a first row.

The transpose matrix may then be scrolled one column to the left, with the elements of the first column scrolled to the last column, or alternatively, one column scrolled to the right, as this disclosure is not limited.

As shown in fig. 1, when scrolling to the left, the data of the first column may scroll to the third column as follows:

the control processing element performs multiplication operation on the elements in the corresponding register again to obtain element products, sums the element products of the same column to obtain a first intermediate result,the second column multiplied by b ₃₃ C is obtained in the first column of (2) ₂₁ 、/>The third column multiplied by b ₃₃ C is obtained in the second column of (2) ₃₂ And->The first column multiplied by b ₃₃ C is obtained in the third column of (2) ₁₃ . C is C ₂₁ 、C ₃₂ And C ₁₃ The first intermediate result is temporarily stored in a buffer as a second row.

Scrolling one column of transpose matrix to the left again, multiplying the element process in the corresponding register to obtain element product, summing the element products of the same column to obtain a first intermediate result C ₃₁ 、C ₁₂ And C ₂₃ C is carried out by ₃₁ 、C ₁₂ And C ₂₃ The first intermediate result is temporarily stored in a buffer as a third line.

That is, the first intermediate result stored in the buffer is

In step S13, for the case of scrolling the first transition matrix to the left, the first intermediate result may be stored in rows, and the controller may scroll the i-th column element in the first intermediate result downward in the column direction by i-1 steps to obtain the product of the input matrices.

Alternatively, in the case of scrolling the first transpose matrix to the right, the first intermediate result may be stored by the controller in rows, and the i-th column element in the first intermediate result may be scrolled in the column direction by i-1 steps to obtain the product of the input matrices. The specific steps are similar to scrolling to the left, and will not be described in detail here.

It will be appreciated by those skilled in the art that for step S13, the product of the input matrix may also be obtained by the controller by scrolling the elements in the first intermediate result in the column direction (e.g., up or down) based on the column and row identification of the first intermediate result. In this embodiment, the elements stored in the register may all carry row and column identifiers of the elements in the matrix, and during the scrolling, the row and column identifiers of the elements in the first intermediate result are determined according to the row and column identifiers of the elements in the matrix, so that the controller may scroll the elements in the first intermediate result in the column direction according to the row and column identifiers of the first intermediate result to obtain the product of the input matrix.

Taking the above example as an example, column 1 scrolls down by 0 steps, i.e., does not scroll. Column 2 scrolls down 1 step, i.e. C ₁₂ Scroll down 1 step to column 1, C ₃₂ Scroll down 1 step to column 3, C ₂₂ Scrolling down 1 step to column 2, the result is:

scrolling column 3 down for 2 steps, the product of the input matrices obtained is:

In a possible implementation, in step S12, the second matrix may also be scrolled in the row direction, the specific process being similar to that of the transpose matrix scroll, except for a slight difference in the manner in which the elements are processed and scrolled in step S13. The present disclosure will not be repeated for specific deriving processes, and refer to the above processes.

The matrix multiplication operation method according to the above embodiments of the present disclosure is more suitable for a processor composed of processing elements arranged in an array. For the input matrix of any scale meeting the arrangement of the processing elements, the operation result of matrix multiplication can be obtained, and compared with the matrix multiplication operation in the related technology, the access times can be reduced, the bandwidth pressure is reduced, and the operation efficiency is improved.

For the case where no blocking is performed, the result of the matrix multiplication can be directly obtained according to the above example. For the situation that the blocking is needed, for the first matrix and the second matrix after the blocking, the result obtained by multiplying the first matrix and the corresponding second matrix is used as a second intermediate result according to the rule of matrix multiplication, that is to say, the second intermediate result can be obtained by taking the first matrix and the second matrix obtained after the blocking as one element of the matrix to execute the operation process of matrix multiplication, and the product of the input matrix can be obtained by calculating according to the second intermediate result.

Fig. 5 shows a schematic diagram of a partition according to an embodiment of the present disclosure. As shown in fig. 5, the controller may block the matrices D and E in the manner described above to obtain a first matrix D ₁₁ 、D ₁₂ 、D ₂₁ 、D ₂₂ And a second matrix E ₁₁ 、E ₁₂ 、E ₂₁ 、E ₂₂ . The controller may perform a matrix multiplication operation using the first matrix and the second matrix as one element of the matrix, e.g., multiplying the first row of matrix D by the first column of matrix E by F ₁₁ ＝D ₁₁ ×E ₁₁ +D ₁₂ ×E ₂₁ Multiplying the first row of matrix D by the second column of matrix E to F ₁₂ ＝D ₁₁ ×E ₁₂ +D ₁₂ ×E ₂₂ Multiplying the second row of matrix D by the first column of matrix E to be F ₂₁ ＝D ₂₁ ×E ₁₁ +D ₂₂ ×E ₂₁ Moment of coupleThe second row of matrix D multiplied by the second column of matrix E is F ₂₂ ＝D ₂₁ ×E ₁₂ +D ₂₂ ×E ₂₂ . That is, in order to obtain the final operation result of the matrix multiplication, it is necessary to obtain the second intermediate result first:

D ₁₁ ×E ₁₁ ，D ₁₂ ×E ₂₁ ，D ₁₁ ×E ₁₂ ，D ₁₂ ×E ₂₂ ，

D ₂₁ ×E ₁₁ ，D ₂₂ ×E ₂₁ ，D ₂₁ ×E ₁₂ ，D ₂₂ ×E ₂₂ 。

The process of obtaining the second intermediate result may be obtained by performing an operation on the corresponding first matrix and second matrix according to the processes of steps S11 to S13, respectively.

The second intermediate result is obtained by partitioning the input matrix and performing matrix multiplication operation of the present disclosure on the partitioned matrix respectively, and the product of the input matrix can be obtained by calculating according to the second intermediate result. According to the operation method of the embodiment of the disclosure, the matrix multiplication process can be rapidly realized for the matrix with any dimension.

In an alternative embodiment, the first matrix and the second matrix after being partitioned may be sequentially stored in the processing element for calculation, or may be stacked and stored in the processing element.

Example 3 stacked storage in combination with step S11-step S13

For example, the method of operation of the present disclosure will be described with an array of 2×2 processing elements and a 4×4 input matrix.

Assuming a multiplication-by-left matrixRight multiplication matrix is +.>The controller may divide both the left-hand matrix and the right-hand matrix into 2 x 2 matrices.

FIG. 6 illustrates a pair of moments according to one embodiment of the present disclosureExamples of array partitioning. As shown in fig. 6, the controller may divide the left-hand matrix and the right-hand matrix into sub-matrices of 2×2, and the division of the left-hand matrix results in four matrices a ₁₁ 、a ₁₂ 、a ₂₁ 、a ₂₂ Wherein a is ₁₁ Is thata ₁₂ Is->a ₂₁ Is->a ₂₂ Is->Dividing the right multiplication matrix to obtain four matrixes b ₁₁ 、b ₁₂ 、b ₂₁ 、b ₂₂ Wherein b ₁₁ Is->b ₁₂ Is->b ₂₁ Is->b ₂₂ Is->

For the case of blocking, if the number of registers included in the processing element can meet the requirement of storing the input matrix, the input matrix may also be stored in the registers of the processing element in a stacked storage manner, so as to implement multiplication of the input matrix. When the input matrices are stored in a stacked memory manner, the controller may divide the registers in the processing elements into a plurality of different groups, each group storing a partitioned first matrix and a corresponding second matrix, the present disclosure is not limited to a specific grouping manner, but each of the registers in the same group may be located in a different processing element.

In the example of storing the input matrix in a stacked storage manner, one possible calculation manner is to scroll the matrix with the first matrix and the second matrix obtained by the partitioning as units, and in the process of calculating the second intermediate result, the processes of steps S11 to S13 are adopted for operation.

Taking the process of steps S11-S13 as an example to calculate the second intermediate result, assume that the processing element is a 2×2 array, and taking the example shown in fig. 6 as an example, for the operation method of the present disclosure, the first matrix may be obtained by dividing the left-hand matrix, or may be obtained after dividing the right-hand matrix.

In the disclosure, the operation method is illustrated by taking the first matrix as an example obtained by partitioning the right-multiplied matrix, loading the second matrix, and loading the corresponding first matrix after transposition, where the loading results are shown in table 1 and table 2. Wherein Reg0, reg1, reg2 and Reg3 respectively represent a group of registers in a processing element, the processing element is a 2×2 array, each processor includes a plurality of registers, the controller can divide the plurality of registers into a plurality of groups, and in this embodiment, the plurality of registers can be divided into 4 groups, and registers in the same group are used to store a transpose matrix and a corresponding second matrix, as shown in table 1 and table 2, and Reg0 stores a ₁₁ And b ₁₁ Reg1 stores a ₁₂ And b ₂₁ Reg2 stores a ₂₁ And b ₁₂ Reg3 stores a ₂₂ And b ₂₂ That is, a matrixMultiplying the first row element by the matrix->The first column element of (c), and the second row element multiplied by the second column element.

Table 1 element storage example

Table 2 element storage example

During the calculation, the processing element may calculate the second intermediate result a according to the process of steps S11-S13 for the elements in the set of registers ₁₁ ×b ₁₁ 、a ₁₂ ×b ₂₁ 、a ₂₁ ×b ₁₂ A) ₂₂ ×b ₂₂ . The specific process is not described in detail. According to the second intermediate result a ₁₁ ×b ₁₁ 、a ₁₂ ×b ₂₁ 、a ₂₁ ×b ₁₂ A) ₂₂ ×b ₂₂ Can calculate C ₁₁ ＝a ₁₁ ×b ₁₁ +a ₁₂ ×b ₂₁ ，C ₂₂ ＝a ₂₁ ×b ₁₂ +a ₂₂ ×b ₂₂ 。

After the second intermediate result is calculated, the transpose matrix may be scrolled in units of groups. Specifically, for transpose matrix One row is scrolled up, that is, the elements of the transpose matrix in Reg2 are scrolled into Reg0, the elements of the transpose matrix in Reg0 are scrolled into Reg2, the elements of the transpose matrix in Reg3 are scrolled into Reg1, and the elements of the transpose matrix in Reg1 are scrolled into Reg3, whereby table 3 can be obtained.

Table 3 element storage example

In combination with tables 1 and 3, during the calculation, the processing element may calculate a second intermediate result a according to the process of steps S11-S13 for the elements in a set of registers ₁₁ ×b ₁₂ 、a ₁₂ ×b ₂₂ 、a ₂₁ ×b ₁₁ A) ₂₂ ×b ₂₁ . The specific process is not described in detail. According to the second intermediate result a ₁₁ ×b ₁₂ 、a ₁₂ ×b ₂₂ 、a ₂₁ ×b ₁₁ A) ₂₂ ×b ₂₁ Can calculate C ₁₂ ＝a ₁₁ ×b ₁₂ +a ₁₂ ×b ₂₂ ，C ₂₁ ＝a ₂₁ ×b ₁₁ +a ₂₂ ×b ₂₁ 。

According to the above process, the product of the input matrices can be calculated in a block manner.

Thus, the matrix multiplication operation method according to the present disclosure can realize matrix operations of arbitrary size scale.

Example 4 stacked storage in combination with Whole scrolling

In another possible implementation manner, another scrolling manner may also be adopted, in which, in the scrolling manner of the present embodiment, step S12 in fig. 3 may be implemented by controlling the processing element to multiply the elements in the corresponding register to obtain element products before scrolling the transposed matrix in the row direction or the column direction once each time, and summing the element products in the same row (or in the example transposed to the first matrix, the same column) to obtain the first intermediate result C ₁₁ 、C ₂₂ 、C ₃₃ 、C ₄₄ 。

Because the input matrix is stored in a blocking and stacking way, when the original data of one row or one column is stored in different groups of registers, and the original data of one row or one column which is continuously stored is changed into at least two rows or at least two columns which are independent data to be stored in different groups of registers, the first data of the next row or the next column of the data stored in the different groups of registers and the last data of the last row or the next column of the data are continuously stored before being stacked and are discontinuously stored after being stacked and stored, after the elements in one group of registers are controlled to scroll once in the row or column direction, the scrolling result needs to be corrected to obtain a correct result. The specific correction mode can be as follows:

scrolling once in the row or column direction for each block transpose;

if the data is scrolled leftwards in the row direction, the correction mode is that the last column of data in each block after the scrolling is scrolled to the last column of the adjacent previous block of data;

if the data is scrolled to the right in the row direction, the correction mode is that the data in the first column in each block after the scrolling is scrolled to the first column of the data in the adjacent next block;

If the data is scrolled upwards in the column direction, the correction mode is that the data of the last row in each block after the scrolling is scrolled to the data of the last row of the adjacent previous block;

if scrolling down in the column direction, the correction is performed by scrolling the first row of data in each block after scrolling to the first row of data in the next, subsequent block.

Wherein, each block mentioned above refers to each block transposed matrix, and each block transposed matrix refers to a matrix after transposed for each block matrix after the block is divided.

For this embodiment, the right-hand matrix is transposed, scrolling is performed in the row direction during scrolling, but since stacked storage is performed, there should be at least two rows of elements that are consecutive, but when stacked storage is considered as separate rows, scrolling in the row direction in each set of registers alone does not allow for correct scrolling, and correction is required.

Taking table 2 as an example, within each set of registers, one row is scrolled up, the scroll results are shown in table 4, in which the first row element in a set of registers scrolls to the last row. But as shown in table 2, the first row elements of Reg0 and Reg1 should scroll to the last row of Reg2 and Reg3, but now at the last row of Reg0 and Reg1 (as shown in table 4); as shown in Table 2, the first row elements of Reg2 and Reg3 should scroll to the last row of Reg0 and Reg1, but now lie in the last row of Reg2 and Reg3 (as shown in Table 4); that is, the last line element of Reg0 and Reg1 should now be located at the last line of Reg2 and Reg3, and the last line element of Reg2 and Reg3 should be located at the last line of Reg0 and Reg1 in table 4, then the process of scrolling can be implemented by exchanging the last line element of Reg2 and Reg0, and exchanging the last line element of Reg3 and Reg1, as shown in table 5.

Table 4 element store example

Table 5 element store example

According to tables 1 and 5, the control processing element multiplies the elements in the corresponding registers to obtain element products, sums the element products of the same row to obtain a first intermediate result C ₁₂ 、C ₂₃ 、C ₃₄ 、C ₄₁ 。

And repeatedly executing the calculation for 4 times and the rolling for 3 times in the process to finish the operation process of matrix multiplication, and obtaining the product of the input matrix according to the first intermediate result.

In an alternative embodiment, the manner of stacking storage may be stored in a block manner according to the foregoing, and is not limited to each register storing one element in a matrix, and is not limited to the matrix multiplying the number of rows and columns being an integer multiple of the number of rows and columns of the processing element, and is not limited to the stacking storage method being unique, where the correction process is the same, and only needs to satisfy that the original row/column elements can be connected in series after the correction, and the specific stacking storage process is not limited herein.

It should be noted that the above manner of stacking storage and scrolling elements is only one example of the disclosure, and may be implemented in other manners, which are not limited in this disclosure.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present disclosure is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present disclosure. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all alternative embodiments, and that the acts and modules referred to are not necessarily required by the present disclosure.

It should be further noted that, although the steps in the flowchart are sequentially shown as indicated by arrows, the steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least a portion of the steps in the flowcharts may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order in which the sub-steps or stages are performed is not necessarily sequential, and may be performed in turn or alternately with at least a portion of the sub-steps or stages of other steps or other steps.

The present disclosure also provides an arithmetic device based on matrix multiplication of a matrix of processing elements, which may be applied to a processor. Fig. 1 shows an example of a processor, which may comprise more than two processing elements arranged in a two-dimensional matrix, each processing element comprising at least one register, said arithmetic means being arranged to implement a matrix multiplication operation on a first matrix and a second matrix.

It should be understood that the above-described apparatus embodiments are merely illustrative and that the apparatus of the present disclosure may be implemented in other ways. For example, the division of the units/modules in the above embodiments is merely a logic function division, and there may be another division manner in actual implementation. For example, multiple units, modules, or components may be combined, or may be integrated into another system, or some features may be omitted or not performed.

In addition, each functional unit/module in the embodiments of the present disclosure may be integrated into one unit/module, or each unit/module may exist alone physically, or two or more units/modules may be integrated together, unless otherwise specified. The integrated units/modules described above may be implemented either in hardware or in software program modules.

The integrated units/modules, if implemented in hardware, may be digital circuits, analog circuits, etc. Physical implementations of hardware structures include, but are not limited to, transistors, memristors, and the like. The registers may be any suitable magnetic or magneto-optical storage medium, such as resistive Random Access Memory RRAM (Resistive Random Access Memory), dynamic Random Access Memory DRAM (Dynamic Random Access Memory), static Random Access Memory SRAM (Static Random-Access Memory), enhanced dynamic Random Access Memory EDRAM (Enhanced Dynamic Random Access Memory), high-Bandwidth Memory HBM (High-Bandwidth Memory), hybrid Memory cube HMC (Hybrid Memory Cube), etc., unless otherwise indicated.

The integrated units/modules may be stored in a computer readable memory if implemented in the form of software program modules and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present disclosure may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a memory, comprising several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the method described in the various embodiments of the present disclosure. And the aforementioned memory includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The disclosed embodiments also provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method. The computer readable storage medium may be a non-volatile computer readable storage medium.

The disclosed embodiments also provide an artificial intelligence chip including a processor as described above.

In one possible implementation, a board is also disclosed, which includes a memory device, an interface device, and a control device, and the artificial intelligence chip described above; wherein the artificial intelligent chip is respectively connected with the storage device, the control device and the interface device; the storage device is used for storing data; the interface device is used for realizing data transmission between the artificial intelligent chip and external equipment; the control device is used for monitoring the state of the artificial intelligent chip.

Fig. 7 shows a block diagram of a board according to an embodiment of the present disclosure, and referring to fig. 7, the board may further include other mating components in addition to the chip 389, including but not limited to: a memory device 390, an interface device 391 and a control device 392;

the memory device 390 is connected to the artificial intelligence chip through a bus for storing data. The memory device may include multiple sets of memory cells 393. Each group of storage units is connected with the artificial intelligent chip through a bus. It is understood that each set of memory cells may be DDR SDRAM (English: double Data Rate SDRAM, double Rate synchronous dynamic random Access memory).

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on both the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the memory device may include 4 sets of the memory cells. Each set of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the artificial intelligence chip may include 4 72-bit DDR4 controllers therein, where 64 bits of the 72-bit DDR4 controllers are used to transfer data and 8 bits are used for ECC verification.

In one embodiment, each set of memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each storage unit.

The interface device is electrically connected with the artificial intelligent chip. The interface device is used for realizing data transmission between the artificial intelligent chip and an external device (such as a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transferred from the server to the chip through the standard PCIE interface, so as to implement data transfer. In another embodiment, the interface device may be another interface, and the disclosure is not limited to the specific form of the other interface, and the interface unit may be capable of implementing a switching function. In addition, the results of the computation of the artificial intelligence chip are still transmitted back to the external device (e.g., server) by the interface device.

The control device is electrically connected with the artificial intelligence chip. The control device is used for monitoring the state of the artificial intelligent chip. Specifically, the artificial intelligent chip and the control device can be electrically connected through an SPI interface. The control device may comprise a single chip microcomputer (Micro Controller Unit, MCU). The artificial intelligent chip can comprise a plurality of processing chips, a plurality of processing cores or a plurality of processing circuits, and can drive a plurality of loads. Therefore, the artificial intelligent chip can be in different working states such as multi-load and light-load. The control device can regulate and control the working states of a plurality of processing chips, a plurality of processing circuits and/or a plurality of processing circuits in the artificial intelligent chip.

The embodiment of the disclosure also provides an electronic device, which comprises: a processor; as described above, the processor includes two or more processing elements, each including at least one register, arranged in a two-dimensional matrix, and a controller that controls the processing elements;

an electronic device further comprising a memory for storing processor-executable instructions; wherein the processor is configured to invoke the instructions stored in the memory to perform the above method.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments. The technical features of the foregoing embodiments may be arbitrarily combined, and for brevity, all of the possible combinations of the technical features of the foregoing embodiments are not described, however, all of the combinations of the technical features should be considered as being within the scope of the disclosure.

The foregoing may be better understood in light of the following clauses:

clause a1. A processor comprising more than two processing elements arranged in a two-dimensional matrix, the processing elements comprising at least one register, the processor being for performing a matrix multiplication operation on a first matrix and a second matrix,

Clause a2 the processor of clause A1,

the controller controls the processing element, the transpose matrix stored in the register, and the second matrix to repeat the following until the elements in the transpose matrix or the second matrix return to the position when not scrolled:

the controller is used for controlling the processing element to multiply the elements in the corresponding register to obtain element products, summing the element products of the same row or the same column to obtain a first intermediate result, and controlling the transposed matrix or the second matrix stored in the register to roll one row or one column in the row direction or the column direction.

Clause a3 the processor of clause A1 or A2,

when the first matrix is a left-square matrix and the second matrix is a right-square matrix, the controller controls elements in the transposed matrix to roll in the row direction or controls elements in the second matrix to roll in the row direction; the control processing element performs multiplication operation on elements in the corresponding register to obtain element products, and sums the element products in the same column to obtain a first intermediate result;

Clause a4 the processor of clause A1 or A2,

the controller stores the first intermediate result in rows or columns, and the product of the first matrix and the second matrix is obtained after the first intermediate result is scrolled in the row direction or the column direction.

Clause a5 the processor of any of clauses A1-A4, the controller further configured to determine whether to block an input matrix based on an arrangement of processing elements and a row rank and a column rank of the input matrix, wherein the input matrix comprises a left-hand matrix and a right-hand matrix;

if one matrix in the input matrixes is to be segmented, the controller splits the rows of the left-hand matrix or splits the columns of the right-hand matrix according to the arrangement of the processing elements;

if both of the input matrices are to be partitioned, the controller partitions the left matrix array direction and the right matrix row direction in the same manner according to the arrangement of the processing elements and the row rank and the column rank of the input matrices;

and dividing the left-multiplied matrix into blocks to obtain more than two first matrixes, dividing the right-multiplied matrix into blocks to obtain more than two second matrixes, or dividing the left-multiplied matrix into blocks to obtain more than two second matrixes, and dividing the right-multiplied matrix into blocks to obtain more than two first matrixes.

Clause a6 the processor of clause A5,

the controller is further configured to calculate a product of the left-hand matrix and the right-hand matrix based on a product of the first matrix and the second matrix.

Clause A7. the processor of clause A5, comprising a plurality of sets of registers,

the controller is further configured to transpose more than two first matrices to obtain transposed matrices after the input matrices are partitioned;

the controller loads the transposed matrix and more than two second matrices into a plurality of groups of registers to be stacked and stored, wherein the transposed matrix and the second matrix at corresponding positions are stored in one group of registers;

before each time the elements in the transposed matrix or the second matrix are rolled once in the row direction or the column direction, the controller controls the processing elements to multiply the elements in the corresponding registers to obtain element products, and the element products in the same row or the same column are summed to obtain a first intermediate result;

the controller also corrects the scrolling result after controlling the elements in a set of registers to scroll a row or column transpose matrix in the row or column direction.

Clause A8. the processor of clause A7, correcting the scrolling result comprising:

wherein, each block transposed matrix refers to a matrix after transposed for each block matrix after the block.

Clause A9. is a method of performing matrix multiplication based on a matrix of processing elements, applied to a processor, the processor comprising two or more processing elements arranged in a two-dimensional matrix, the processing elements comprising at least one register, the method performing matrix multiplication on a first matrix and a second matrix, the method comprising:

Clause a10. According to clause A9, controlling the transpose matrix or the second matrix to scroll in a row direction or a column direction, controlling the processing element to multiply elements in the corresponding register to obtain element products, summing the element products of the same row or column to obtain a first intermediate result, including repeating the following processes until the elements in the transpose matrix or the second matrix recover to a position when not scrolled:

the control processing element performs multiplication operation on elements in the corresponding register to obtain element products, sums the element products of the same row or the same column to obtain a first intermediate result, and scrolls the transposed matrix or the second matrix in the matrix of the processing element by one row or one column in the row direction or the column direction.

Clause a11. The method of clause A9 or a10,

when the first matrix is a left-square matrix and the second matrix is a right-square matrix, controlling elements in the transposed matrix to roll in the row direction or controlling elements in the second matrix to roll in the row direction; the control processing element performs multiplication operation on elements in the corresponding register to obtain element products, and sums the element products in the same column to obtain a first intermediate result;

when the first matrix is a right-square matrix and the second matrix is a left-square matrix, controlling elements in the transposed matrix to roll in the column direction or controlling elements in the second matrix to roll in the column direction; the control processing element performs multiplication operation on elements in the corresponding register to obtain element products, and sums the element products of the same row to obtain a first intermediate result.

Clause a12 the method of clause A9 or a10, processing the first intermediate result to obtain a product of the first matrix and a second matrix, comprising:

and storing the first intermediate result in rows or columns, and rolling in the row direction or the column direction to obtain the product of the first matrix and the second matrix.

The method of any of clauses A9-a12, the method further comprising:

Determining whether to block an input matrix according to the arrangement of the processing elements and the row rank and the column rank of the input matrix, wherein the input matrix comprises a left-square matrix and a right-square matrix;

if one matrix in the input matrixes is to be segmented, splitting the rows of the left-hand matrix or splitting the columns of the right-hand matrix according to the arrangement of the processing elements;

if both of the input matrixes are to be partitioned, the left matrix array direction and the right matrix array direction are partitioned in the same mode according to the arrangement of the processing elements and the row rank and the column rank of the input matrixes;

Clause a14 the method of clause a13, the method further comprising:

the product of the left-hand matrix and the right-hand matrix is calculated from the product of the first matrix and the second matrix.

Clause a15 the method of clause a13, wherein the processor comprises a plurality of sets of registers,

The method further comprises the steps of:

after the input matrix is segmented, more than two first matrices are transposed to obtain transposed matrices;

stacking and storing the transposed matrix and more than two second matrices in the plurality of groups of registers, wherein the transposed matrix and the second matrix at corresponding positions are stored in one group of registers;

before each time the elements in the transposed matrix or the second matrix are rolled once in the row direction or the column direction, the processing element is controlled to multiply the elements in the corresponding register to obtain element products, and the element products in the same row or the same column are summed to obtain a first intermediate result;

after controlling the elements in a set of registers to scroll a row or column transpose in the row or column direction, the scrolling result is modified.

Clause a16 the method of clause a15, wherein modifying the scrolling result comprises:

Clause a17 an artificial intelligence chip comprising a processor as set forth in any of clauses A1-A8.

Clause a18 an electronic device comprising the artificial intelligence chip of clause a17.

The foregoing has outlined rather closely the embodiments of the present disclosure, and detailed description of the principles and embodiments of the present disclosure have been presented herein with the application of specific examples, the description of the examples above being merely intended to facilitate an understanding of the method of the present disclosure and its core ideas. Meanwhile, those skilled in the art will recognize that modifications or variations made on the basis of the specific embodiments and application scope of the present disclosure are within the scope of the protection of the present disclosure in light of the ideas of the present disclosure. In view of the foregoing, this description should not be construed as limiting the disclosure.

Claims

1. A processor comprising two or more processing elements arranged in a two-dimensional matrix, the processing elements comprising at least one register, the processor being configured to perform a matrix multiplication operation on a first matrix and a second matrix,

the controller is further used for processing the first intermediate result to obtain a product of a first matrix and a second matrix;

the processor comprises a plurality of groups of registers, and the controller is further used for transposing more than two first matrixes to obtain transposed matrixes after the input matrixes are segmented; the controller loads the transposed matrix and more than two second matrices into the plurality of groups of registers to be stacked and stored, wherein the transposed matrix and the second matrices at corresponding positions are stored in one group of registers; before each time the elements in the transposed matrix or the second matrix are rolled once in the row direction or the column direction, the controller controls the processing elements to multiply the elements in the corresponding registers to obtain element products, and the element products in the same row or the same column are summed to obtain a first intermediate result; the controller also corrects the scrolling result after controlling the elements in a set of registers to scroll a row or column transpose matrix in the row or column direction.

2. The processor of claim 1, wherein the processor further comprises a processor controller,

3. A processor according to claim 1 or 2, wherein,

4. A processor according to claim 1 or 2, wherein,

5. The processor of claim 1, wherein the controller is further configured to determine whether to block the input matrix based on an arrangement of the processing elements and a row rank and a column rank of the input matrix, wherein the input matrix comprises a left-by matrix and a right-by matrix;

6. The processor of claim 5, wherein the processor further comprises,

7. The processor of claim 1, wherein correcting the scrolling result comprises:

8. A method of matrix multiplication based on a matrix of processing elements, applied to a processor, the processor comprising two or more processing elements arranged in a two-dimensional matrix, the processing elements comprising at least one register, the method implementing a matrix multiplication operation on a first matrix and a second matrix, the method comprising:

processing the first intermediate result to obtain the product of a first matrix and a second matrix;

the processor includes a plurality of sets of registers, the method further comprising: after the input matrix is segmented, more than two first matrices are transposed to obtain a transposed matrix; stacking and storing the transposed matrix and more than two second matrices in the plurality of groups of registers, wherein the transposed matrix and the second matrix at corresponding positions are stored in one group of registers; before each time the elements in the transposed matrix or the second matrix are rolled once in the row direction or the column direction, the processing element is controlled to multiply the elements in the corresponding register to obtain element products, and the element products in the same row or the same column are summed to obtain a first intermediate result; after controlling the elements in a set of registers to scroll a row or column transpose in the row or column direction, the scrolling result is modified.

9. The operation method according to claim 8, wherein controlling the transpose matrix or the second matrix to scroll in a row direction or a column direction, controlling the processing element to multiply elements in the corresponding register to obtain element products, summing the element products of the same row or column to obtain a first intermediate result, includes repeating the following processes until the elements in the transpose matrix or the second matrix recover to a position when not scrolled:

10. The method according to claim 8 or 9, wherein,

11. The method according to claim 8 or 9, wherein processing the first intermediate result to obtain the product of the first matrix and the second matrix comprises:

12. The method of claim 8, wherein the method further comprises:

13. The method according to claim 12, wherein the method further comprises:

14. The method of claim 8, wherein modifying the scrolling result comprises:

15. An artificial intelligence chip, characterized in that the chip comprises a processor according to any one of claims 1-7.

16. An electronic device comprising the artificial intelligence chip of claim 15.