CN117501250A

CN117501250A - Method and apparatus for matrix computation acceleration

Info

Publication number: CN117501250A
Application number: CN202280042065.5A
Authority: CN
Inventors: 叶子纯; 愷婷艾米·王
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-09-22
Filing date: 2022-09-22
Publication date: 2024-02-02
Also published as: WO2023046001A1

Abstract

The invention provides a matrix multiplication acceleration method and a matrix multiplication acceleration device. DMA operations move matrix data from host memory to accelerator memory and vice versa. The DMA operation also rearranges the matrix data into a suitable two-dimensional format or four-dimensional format. The accelerator may perform multiplications for various parts of the matrix at a time. In some cases, the DMA operation may swap the order of the two matrices of multiplications to obtain a transpose of the multiplication result.

Description

Method and apparatus for matrix computation acceleration

Cross reference to related applications

The present application claims the benefit of U.S. provisional application No. 63/247,029 filed on month 22 of 2021, the entire contents of which are incorporated herein by reference.

Technical Field

The present invention relates generally to the field of numerical computation, and more particularly to a method and apparatus for performing matrix multiplication using computer means.

Background

Matrix multiplication such as general matrix multiplication (general matrix multiplication, GEMM) is a computer operation task that has great significance in the fields of artificial intelligence (artificial intelligence, AI), high performance computing (high performance computing, HPC), scientific computing, and the like. Since matrix multiplication operations are computationally intensive with respect to matrix size, it has become a design trend to use dedicated hardware such as matrix multiplication units to accelerate matrix multiplication. Examples of suitable hardware include Matrix multiplication unit in chip, tensor core in graphics processing unit (graphical processing unit, GPU) and tensor processing unit ^TM (tensor processing unit, TPU) matrix elements.

To speed up matrix multiplication computations, some specialized AI hardware operate on blocks of input data, rather than a single memory location or a pair of memory locations at a time. The large matrix multiplication computation is then broken down into smaller computation blocks. This may be particularly beneficial when the matrix is large. By breaking up the multiplication into small blocks, the blocked input and output data may be included in a limited buffer space near the processor core. Each tile matrix that is multiplied is typically a square matrix in a dimension, e.g., a 16 row by 16 column matrix, where each element belongs to a particular data type (e.g., single precision or half precision floating point). Such a patch matrix is also referred to herein as fractal (fractional), but other terms related to matrix block decomposition and block matrix multiplication are also applicable (e.g., fractal may be referred to as a block or sub-matrix). Multiplication by block matrix multiplication is shown in fig. 1 and is known in the art. Matrix A105 is decomposed into a plurality of fractal A ₁₁ 、A ₁₂ 、A ₂₁ 、A ₂₂ Matrix B110 is decomposed into a plurality of blocks (fractal) B ₁₁ 、B ₁₂ 、B ₂₁ 、B ₂₂ . Although each matrix is shown as broken down into two rows and two columns of fractal, it is possible to break down into other numbers of fractal. The result of multiplying matrices a and B is equal to block matrix 115, where each blockEqual to the sum of the products of the blocks in matrices a and B, as shown. When the fractal is replaced with a scalar, this "sum of products" is identical in form to the sum of products that occur in conventional matrix multiplication.

Host computer memory (e.g., random access memory (random access memory, RAM)) typically occupies linear address space. This means that the host computer memory can be seen as a large one-dimensional array. For a matrix to be a two-dimensional array in nature, the computer must map the two dimensions of the matrix (two-dimensional array) to one dimension of main memory. There are two main formats for achieving this: two-dimensional (2D) format and four-dimensional (4D) format. The 2D format traverses the matrix in a direction starting with the next row/column at the end of the current row/column. The 2D format may be implemented using a row major sequence or a column major sequence. The difference between the sequences is which elements in the array are contiguous in memory. In the row main sequence, consecutive elements in a given row are contiguous in memory, as are consecutive elements in columns under the column main sequence. If the size of the matrix is greater than the size of the fractal (e.g., the matrix includes more than one fractal), then in such a representation, the data of the fractal in the matrix is not contiguous in memory. Fig. 2 shows an example of a 2D format under a row main order and a column main order. In fig. 2, a matrix 200 including rows indexed by index variable i and columns indexed by index variable j includes elements represented by letters A, B, C … …. The number of rows is denoted nrows (=4) and the number of columns is denoted ncolumns (=3). The contents of the main memory holding the matrix under the row main order are shown at 220. In this case, successive matrix elements row by row are stored in adjacent memory locations, and adjacent rows are stored in adjacent successive memory blocks. The contents of the main memory of the matrix under the column main order are saved as shown at 230. In this case, successive column-wise matrix elements are stored in adjacent memory locations, with adjacent columns being stored in adjacent successive memory blocks.

The 4D format (also referred to as "fractal format") assumes that the matrix is divided into a plurality of fractal shapes. The 4D format first traverses the matrix inside each fractal and then traverses the matrix between the fractal. In the 4D format, the data is no longer completely continuous in the manner shown in fig. 2. Instead, the data is first contiguous inside each block, then between blocks. In 4D format, there are four possible orders for traversing the matrix: the order within the fractal may be a row main order or a column main order, and the order between the fractal may be a row main order or a column main order. Fig. 3 shows an example of a 4D format with line order both inside and between fractal. Specifically, fig. 3 shows a matrix comprising four row matrix fractals (blocks) and four column matrix fractals (blocks). Each fractal itself comprises 16 rows and 16 columns. The line 330 is shown to represent the traversal of the matrix element from top left to bottom right such that the matrix element is continuous in computer memory. (note that each "Z" shape within each matrix fractal is a simplification, that is, each matrix fractal actually has 16 horizontal lines instead of the two shown.) thus, line 330 traverses the first row of the first fractal, then the second row of the first fractal, and so on until all rows of the first fractal have been traversed, then the line traverses the rows of the second fractal in the same manner, then the rows of the third fractal, and so on until all 16 fractal have been traversed. The first fractal is the fractal in the first row and first column, the second fractal is the fractal in the first row and second column, the fifth fractal is the fractal in the second row and first column, and so on. It should be noted that fig. 2 may also illustrate the order of fractal contents in the memory. Further, if each letter element in fig. 2 is understood as a block of data, fig. 2 may illustrate the order in which the multiple fractal is in memory.

The 2D format is a natural way of storing matrices as two-dimensional arrays, with matrix data typically stored in host computer memory in such a 2D format. This also facilitates access to matrix data and resizing of matrices in a simple manner. However, when computation of matrix multiplication is involved, some hardware accelerators require matrix data to be provided in 4D format. To achieve this, the conventional solution is to transform the entire matrix from 2D format to 4D format, then provide it to the accelerator for computation, then receive the result, and then transform the result back to 2D format. This transformation operation may be implemented by calculation at the host computer side or by using an additional data format transformation kernel at the hardware accelerator side. However, this method requires a lot of additional computational effort and time to transform the data format, and also requires additional storage to store the transformed results. Furthermore, in some applications, this method of transforming the format of the entire matrix is significantly inefficient when it is desired to read or write sub-matrices from or to a larger matrix, especially when a series of successive matrix multiplications are performed in order to support real-time applications and the like.

Accordingly, there is a need for a computing method and apparatus for performing and supporting matrix multiplication that obviates or mitigates one or more of the limitations of the prior art.

The purpose of the background is to reveal information believed by the applicant to be of possible relevance to the present invention. No admission is made that any of the above information constitutes prior art against the present invention.

Disclosure of Invention

The embodiment of the invention aims to provide a matrix multiplication acceleration method and device. In this regard, embodiments of the present invention provide for the processing of one or more data layouts stored in a computer memory, where each data layout stores a matrix (or set of matrices) in one or more specific formats. The above-described processing includes direct memory access (direct memory access, DMA) operations for moving matrix data to and from a hardware accelerator device (referred to herein simply as an "accelerator") that performs matrix multiplication and returns the results of the matrix multiplication. The DMA operation moves the data value from the source memory to the target memory and is performed using the following: the format in which the matrix(s) are stored is changed according to circumstances. This may be accomplished by having the DMA operation move data elements from a source memory location to a target memory location according to a set of specific rules. Accordingly, embodiments of the present invention provide a method of transforming a matrix format by direct memory access operations to accelerate matrix multiplication. Thus, format conversion and DMA operations may be combined. This approach is in sharp contrast to prior art schemes in which matrix format conversion is performed separately from (e.g., before or after) providing matrix data to a multiplier (e.g., an accelerator).

The DMA operation moves the matrix data from the source memory to the target memory. Transforming the matrix format refers to making the format of one or more matrices or portions of one or more matrices different in the target memory than in the source memory. Examples of different formats include 2D and 4D formats and different types of 2D and 4D formats. The types of the 2D format include a row main (denoted as Z) type and a column main (denoted as N) type. Types of 4D formats include row owner between and within the fractal (denoted zZ), column owner between and within the fractal (denoted nN), row owner between column owner within the fractal (denoted nZ), and column owner between row owner within the fractal (denoted zN). ( The letters "N" and "Z" are used because their shapes describe the order in which the matrix (or fractal) elements are traversed as they move continuously through memory. Lower case letters refer to formats within the fractal, and upper case letters refer to formats between the fractal. )

In the discussion that follows, we consider matrix multiplication to be performed by an accelerator that requires matrix data to be in 4D format. An accelerator is typically a specialized hardware device within or operatively coupled to a host computer (also referred to herein simply as a "host"). The host may be a personal computer (e.g., an x86 device) or other type of computing device. The accelerator may be used as a chip or chipset permanently installed in the host, or coupled to the host through an expansion slot or other suitable internal or external data bus, or the like. The accelerator performs matrix multiplication at the fractal level, that is, the matrix is considered to include a plurality of blocks and multiplies the different blocks, as conceptually shown in fig. 1. The two input matrices to be multiplied and the result matrix are divided into a plurality of fractal shapes. In each calculation step of the matrix multiplication, the accelerator accepts as input a row of the fractal in the left input matrix and a column of the fractal in the right input matrix, and calculates one of the fractal in the result matrix from the inputs. The left input matrix is the left matrix in a (standard or generalized) matrix multiplication operation and the right input matrix is the right matrix in a matrix multiplication operation.

According to an embodiment of the present invention, the operation of moving data from a host or associated memory to an accelerator or associated memory is combined with the operation of properly formatting (in 4D format) the matrix at the destination of the data movement, rather than transforming the matrix format before starting the matrix multiplication computation (using the accelerator). Thus, the matrix format is transformed by the DMA operation. For example, each time a row of fractal is shifted from the left input matrix by DMA, a column of fractal is shifted from the right input matrix by DMA, the required format conversion may be inherently accomplished by setting parameters in the DMA command. The term "move" as used herein may refer to moving or copying data. These instructions may cause the DMA operation to move the data from each specified source location to a specified appropriate destination location, which may result in the data being rearranged in a manner that achieves the desired format conversion. The left input matrix is also called the multiplier and the right matrix is also called the multiplicand.

According to an embodiment of the invention, the left input matrix and the right input matrix for multiplication are interchanged when the result matrix is required to be in a transposed format, in addition to or instead of a format transformation within the matrix. This exchange may be used to facilitate the result matrix to take on the desired transposed format. (it will be readily appreciated that in transposed format, rows in the matrix are represented as columns, and vice versa.) in some embodiments, the exchange is used with other format conversion techniques described above. Matrix swapping may be implemented by DMA operations that move data in a particular way. These embodiments may be used to avoid or reduce the need for separate matrix transpose operations, thereby increasing computational efficiency.

According to an embodiment of the invention, a method is provided. The method comprises the following steps: matrix data is migrated from a source memory to a target memory using one or more data migration operations. The data shifting operation is used for rearranging the matrix data from a first format to a second format. The first format is a format in which the matrix data is stored in the source memory. The second format is a format requiring the matrix data to be stored in the target memory.

In some embodiments, the source memory is memory in a host computing device and the target memory is memory in a hardware accelerator device for performing matrix multiplication. In these embodiments, the method further comprises: the hardware accelerator device performs one or more matrix multiplication operations on the matrix data in the target memory to produce output matrix data. In some other embodiments, the method further comprises: the output matrix data is moved from memory in the hardware accelerator device to memory in the host computing device using one or more second data movement operations.

In some further embodiments, the second data shifting operation is used to reorder the output matrix data from a third format to a fourth format. The third format is a format in which the matrix data is provided by the hardware accelerator device. The fourth format is a format requiring the output matrix data to be stored in memory in the host computing device.

In other further embodiments, the first data-moving operation and the second data-moving operation are used in common to reorder the output matrix data from a third format to a fourth format. Also, the third format is a format in which the matrix data is provided by the hardware accelerator device, and the fourth format is a format in which the output matrix data is required to be stored in a memory in the host computing device. In some other embodiments, the moving the matrix data from the source memory to the target memory includes: shifting data corresponding to a left matrix and shifting data corresponding to a right matrix, wherein the left matrix and the right matrix are multiplied; the first data shifting operation and the second data shifting operation collectively operable to reorder the output matrix data from a third format to a fourth format comprises: and exchanging the data corresponding to the left matrix with the data corresponding to the right matrix to obtain the transpose of the output matrix.

According to an embodiment of the present invention, an apparatus is provided. The device comprises: host matrix memory, accelerator matrix memory, hardware accelerator devices, and direct memory access manager. The host matrix memory is configured to store matrix data in a first format. The accelerator matrix memory is configured to store the matrix data in a second format. The hardware accelerator device is configured to perform matrix multiplication based on the matrix data stored in the accelerator matrix memory. The direct memory access manager is to move the matrix data from the host matrix memory to the target matrix memory using one or more data movement operations. The data shifting operation is used for rearranging the matrix data from a first format to a second format. The first format is a format in which the matrix data is stored in the host matrix memory. The second format is a format requiring the matrix data to be stored in the accelerator matrix memory. The apparatus may be used to implement embodiments of the method as described above.

According to an embodiment of the invention, a method is provided. The method comprises the following steps: the receiving hardware accelerator performs an instruction of a matrix multiplication operation using input matrix data stored in a first format in a host memory as input. The method further comprises the steps of: determining that the output of the matrix multiplication operation is stored in the host memory in a second format; and moving the input matrix data stored in the host memory to a hardware accelerator memory. The method further comprises the steps of: reformatting the input matrix data from the first format to a third format, wherein the third format is a four-dimensional (4D) format required by the hardware accelerator memory to store the input matrix data for the matrix multiplication operation. The method further comprises the steps of: the hardware accelerator performs the matrix multiplication operation on the input matrix data reformatted into the third format stored in the hardware accelerator memory to produce result matrix data in a fourth format, wherein the fourth format is a 4D format required by the hardware accelerator memory to store the result matrix data. Furthermore, the method comprises: reformatting the result matrix data into an order compatible with the second format; and moving the reformatted result matrix data from the hardware accelerator memory to the host memory for storage as an output of the matrix multiplication operation.

In some embodiments, the first format designates the order of the data as row-major or column-major, and the second format designates the order of the data as row-major or column-major.

In some embodiments, the second format is different from the first format.

In some embodiments, the reformatting the input matrix data from the first format to a third format includes: and arranging the input matrix data into one or more fractal, wherein the third format designates the sequence of the input matrix data in the fractal and the sequence between the fractal according to the format required by the memory of the hardware accelerator. In other embodiments, the third format designates the order of data within the fractal as row-dominant or column-dominant and designates the order between the fractal as row-dominant or column-dominant.

In some embodiments, the reformatting the input matrix data from the first format to a third format further comprises: the input matrix data is transposed within at least some of the fractal according to a format required for the input matrix data to be stored in the hardware accelerator memory.

In further embodiments, the reformatting the input matrix data from the first format to a third format includes: determining left and right matrix data to multiply as part of the matrix multiplication operation; determining that an order of data within a fractal of the resultant matrix data generated in the fourth format by multiplying the left matrix data with the right matrix data corresponds to a transpose of the second format; the matrix multiplication is performed by exchanging the left matrix data with the right matrix data to avoid having to transpose the input matrix data within the fractal of the result matrix data. The reformatting the result matrix data into an order compatible with the second format includes: the order of the data within the fractal of the result matrix data is preserved. In some other embodiments, the exchanging the left matrix data with the right matrix data includes: and transferring data in at least one of the left matrix data and the right matrix data.

In some embodiments, the reformatting the result matrix data from the fourth format to the second format further comprises: at least some of the result matrix data is transposed.

In some embodiments, the instruction to perform a matrix multiplication operation is received from an application, the method further comprising: the application defines the first format and defines the second format independent of the first format.

According to an embodiment of the invention, a method is provided. The method comprises the following steps: receiving an instruction of a hardware accelerator to perform matrix multiplication operation by using input matrix data stored in a first format in a host memory as input; determining that the output of the matrix multiplication operation is stored in the host memory in a second format; and moving the input matrix data stored in the host memory to a hardware accelerator memory. The method further comprises the steps of: determining that a format of a resulting matrix generated by multiplying left matrix data and right matrix data as part of the matrix multiplication operation corresponds to a transpose of the second format; the matrix data is reformatted from the first format to a third format by exchanging the left matrix data with the right matrix data for the matrix multiplication operation. The method further comprises the steps of: the hardware accelerator performing the matrix multiplication operation on the input matrix data reformatted into the third format stored in the hardware accelerator memory to produce resultant matrix data corresponding to the second format; and moving the result matrix data from the hardware accelerator memory to the host memory to be stored as the output of the matrix multiplication operation.

In some embodiments, said exchanging said left matrix data with said right matrix data comprises: and suppressing data transposition in at least one of the left matrix data and the right matrix data to obtain an optimized third format.

The embodiment of the invention can be applied to various situations. For example, in some embodiments, embodiments may be applied to an artificial intelligence (artificial intelligence, AI) device (e.g., a chip) that includes a matrix multiplication accelerator that requires the matrix to be in a particular (e.g., 4D) format.

Embodiments are described above in connection with various aspects of the invention, which may be implemented based on these aspects. Those skilled in the art will appreciate that embodiments may be implemented in conjunction with aspects describing these embodiments, but may also be implemented with other embodiments of the aspects. It will be apparent to those skilled in the art that embodiments are mutually exclusive or incompatible with each other. Some embodiments may be described in connection with one aspect, but may be adapted for use with other aspects, as will be apparent to those skilled in the art.

Drawings

Other features and advantages of the present invention will become apparent from the following detailed description of the invention, taken in conjunction with the accompanying drawings.

Fig. 1 shows a prior art provided block matrix multiplication.

Fig. 2 illustrates the prior art storage of matrix data in a 2D row master format and a column master format.

Fig. 3 illustrates the prior art providing for storing matrix data in a 4D zZ format.

FIG. 4 generally illustrates rearranging matrix data between a source memory and a target memory during DMA transfers provided by one embodiment.

Fig. 5A and 5B illustrate a portion of a matrix multiplication operation using fractal as provided by one embodiment.

FIG. 6 illustrates one example of DMA movement of matrix data to an accelerator provided by one embodiment.

FIG. 7 illustrates an example of DMA movement of matrix data to an accelerator provided by another embodiment.

FIG. 8 illustrates one example of DMA movement of matrix data from an accelerator provided by one embodiment.

Fig. 9A and 9B illustrate one example of sequential swapping of left and right matrices to achieve transpose of matrix data provided by an accelerator as the matrix data is moved to the accelerator, as provided by one embodiment.

FIG. 10 illustrates interaction of a host computer and an accelerator device provided by one embodiment.

FIG. 11 illustrates an apparatus for performing matrix multiplication provided by one embodiment.

FIG. 12 illustrates the format of the multipliers, multiplicands, and result matrices provided by one exemplary embodiment.

It should be noted that throughout the appended drawings, like features are identified by like reference numerals.

Detailed Description

The embodiment of the invention relates to a matrix multiplication acceleration method and a matrix multiplication acceleration device. The method includes performing a DMA operation to move matrix data from host memory to accelerator-accessible memory and/or from accelerator-accessible memory to host memory. The DMA operation moves the matrix data while also reordering or reformatting the matrix data into a specified format suitable for the next operation. In the case of moving to an accelerator (or associated memory) for processing by the accelerator, such operations are typically matrix multiplication operations involving multiplication of matrix blocks (fractal), the format being typically a 4D format of a specified type. In some cases, the DMA operations may swap the order of the two matrices to facilitate the matrix transpose operation.

Fig. 4 generally illustrates one aspect of the present invention. The first memory 410 holds matrix data and the second memory 420 is designated to hold the same matrix data in different formats. For example, the first memory 410 may initially hold data for a pair of matrices to be multiplied by the accelerator, and the first memory may be accessed by the host (but not necessarily by the accelerator). Further, in this example, a second memory 420 may be designated for holding the same data in a different format, which may be accessed by the accelerator, and may also be architecturally close to the accelerator. One or more DMA operations 430 move data from first memory 410 to second memory 420. Notably, the DMA operation reorders (reformats) the data, and is therefore shown by the arc arrow. The reordering (reformatting) is performed in such a way that: the matrix format is changed in a specified manner (e.g., from 2D format to 4D format) and/or two matrices are swapped (the multiplier becomes multiplicand, and vice versa). It should be noted that not all arrows are shown, but typically the contents of each associated memory location are moved. It should also be noted that the beginning and ending of the arrows in fig. 4 do not necessarily have any particular meaning, except to illustrate that the rearrangement occurs in some form. The rearrangement itself (corresponding to the beginning and end of the arrow) may be configured according to a given operating parameter 440. The operating parameters may be based on hardware requirements and constraints, formats of host and accelerator requirements, application requirements, and the like.

Accordingly, embodiments of the present invention relate to methods and apparatus for moving matrix data between memories by DMA or similar operations, such that the matrix data is also rearranged (reformatted) and such that the matrix data is provided to and/or from an accelerator performing matrix multiplication. Reordering (reformatting) enables a hardware accelerator (hardware accelerator device) to properly perform the desired matrix multiplication operations, the output of which is provided to the host in a proper format (a second reordering (reformatting) may be required during the move back to the host). This typically requires that the matrix data provided to the accelerator be in a specific (e.g., 4D) format. The output may be provided to the host in a suitable format via DMA. DMA move operations from host to accelerator and/or from accelerator to host may be used to achieve the appropriate data reordering.

According to an embodiment of the present invention, as shown in fig. 5A and 5B, two input matrices 510, 520 to be multiplied and a result matrix 530 (representing the multiplication result) are divided into a plurality of fractal, e.g., 512, respectively. In each calculation step of the matrix multiplication, a row of fractal 514 in the left input matrix 510 and a column of fractal 524 in the right input matrix 520 are used to calculate one of the fractal 534 in the result matrix 530. Thus, each step of the calculation includes three parts. First, for each input matrix 510, 520, one stripe is moved to the accelerator by a DMA operation. A stripe refers to a row of fractal or a column of fractal, depending on the matrix. Next, the accelerator performs a matrix multiplication operation from the shifted two stripes and obtains the corresponding fractal in the result matrix 530. Third, the acquired fractal is moved back (e.g., via DMA) to the memory location designated for storing the result matrix. This process may be repeated as necessary to obtain each desired fractal in the results matrix 530. Fig. 5B shows the same situation as fig. 5A, but in the special case of 2 x 2 fractal. Each matrix element is labeled with a value "xij," where "x" represents a matrix A, B or C, i represents a matrix row number, and j represents a matrix column number. Here and elsewhere, 2 x 2 fractal uses cross-shadow markers, and adjacent 2 x 2 fractal uses different types of cross-shadow markers.

Since matrices A, B and C are both column-dominant, the data order of matrix A in host memory is a11, a21, a31 … … a81, a12, a22, a32 … … a82 … … a18, a28, a38 … … a88. The data order of matrices B and C in the host memory is similar. For the fractal shown, the order of the data (in host memory) is as follows: for the fractal shown in matrix a: a31, a41, (skipping 6 elements in a), a32, a42, (skipping 6 elements in a), a33, a43, (skipping 6 elements in a), a34, a44, … … for the fractal shown in matrix B: b13, b23, b33, b43 … … b83, b14, b24 … … b84. For the fractal shown in matrix C: c33, C43, (skipping 6 elements in C), C34, C44.

Implementation details according to some embodiments are provided below. It is assumed in these embodiments that the DMA instruction includes the following parameters. A source pointer address is provided, denoted "src". The source pointer address indicates a start position of a memory block from which the matrix data is to be moved. A destination pointer address is provided, denoted "dst". The target pointer address indicates a starting position of a memory block to which the matrix data is to be moved. A length parameter is provided, denoted "len". The length parameter indicates the length of data to be moved at each DMA sub-operation (starting from src), e.g. expressed in terms of the number of memory locations. A repetition parameter is provided, denoted as "repeat". The repetition parameter indicates the number of sub-operations to be performed, each DMA sub-operation is considered a repetition of the same basic operation, and other parameters of each sub-operation are adjusted as appropriate. A first stride parameter may be provided, denoted as "src_stride". The first stride parameter indicates the number of strides or memory locations in the source memory to be incremented between each DMA sub-operation. That is, if one DMA sub-operation replicates data from the source memory ending with memory location m, the next DMA sub-operation moves data from the same source memory starting with memory location m+src_stride. A second stride parameter may be provided, denoted as "dst_stride". The second stride parameter indicates the number of strides or memory locations in the target memory to be incremented between each DMA sub-operation. That is, if one DMA sub-operation copies data into a target memory ending with memory location m, the next DMA sub-operation moves data to the same target memory starting with memory location m+dst_stride. The DMA sub-operations may together constitute the entire DMA operation.

For simplicity, the shape of each fractal (matrix block) is considered to be square, i.e. the number of rows is the same as the number of columns. The length of each fractal is denoted as n_f, indicating the number of rows (or columns) in the fractal. For further simplicity, the input matrix (in host memory) is considered to be in 2D format (row master type or column master type) while the matrix (in accelerator memory) is required to be in 4D format (zZ type, nZ type, zN type or nN type).

One example of moving stripes in an input matrix from host memory to accelerator memory is as follows. In this example, it is assumed that the direction of the stripe (row or column) in the move is the same as the order of input in the corresponding input matrix. For example, a stripe may be a row of fractal extracted from a matrix in a row-major format, or a stripe may be a column of fractal extracted from a matrix in a column-major format. It should be noted that in this case, the entire row or column in the stripe exists in a sequential set of memory locations. Further, for a row fractal extracted from the matrix in the row master format, the target matrix may be a matrix in the 4D zZ format in this example. For a column fractal extracted from the matrix in the column primary format, the target matrix may be a matrix in the 4D nZ format in this example. The method is generic in that the generic method includes fractal of the target to a fractal layout (xZ or xN, where x is z or n).

In this example, the DMA operation is performed as a set of block_n DMA instructions, where "block_n" is the number of fractal in a stripe. The ith DMA instruction may include the following parameters: src is the starting address of the ith fractal in the input matrix in the host memory; dst is the start address of the ith fractal in the input matrix in the target memory; len is equal to N_f; repeat is equal to n_f; src_stride is set to N-N_f, where N is the leading dimension of the input matrix. For example, if the input matrix is row-major, then N is the number of matrix elements per row. If the input matrix is column dominant, then N is the number of matrix elements per column. The value of dst_stride is set to 0.

Fig. 6 shows the above example when a row of fractal is moved. The marked fractal 610 is the fractal to be moved. Bold face boxes 620 (b 53, b63, b73, b83, b14, b 24) indicate data elements skipped by a particular DMA instruction (i.e., i=2) according to the value of src_stride. The other area 625 represents the remaining matrix data that is not part of the moving stripe. It should be noted that if necessary (not shown), the data may be transposed by transposing each fractal or the like. The numbers in the fractal may be used to track the relative locations of the fractal in host memory and target memory. Fractal 610 is a pre-movement fractal and fractal 630 is a post-movement fractal. The large arrow indicates a DMA operation.

In more detail, in one example, fig. 6 may illustrate moving a column fractal from a matrix B stored in a column master format to an accelerator such that the column adopts zN format. In this case, the direction of the fractal (column) is the same as the direction of the data layout (column owner). The columns in matrix B may be loaded into memory assigned to the left matrix (denoted X) in the accelerator memory. The format of matrix X may be zZ format. In host memory, the matrix is stored in column master format, so the order of the data of the second fractal in motion (in host memory) is b33, b43, b34, b44. In the accelerator, the matrix takes zN format, so the data is row-dominant within the fractal. For example, the order of the data in the second fractal (in the accelerator memory) is b33, b43, b34, b44. Thus, in this example, the matrix is transposed. In this example, in the case of the order of switching matrices a and B, the resulting fractal does not have to be transposed when moved back to host memory. Without swapping A and B, the resulting fractal may need to be transposed when moving back to host memory.

Another example of moving stripes in an input matrix from host memory to accelerator memory is as follows. In this example, it is assumed that the direction of the stripe (one row of fractal or one column of fractal) in the move is different from the order of inputs in the corresponding input matrix. For example, a stripe may be a row of fractal extracted from a matrix in a column primary format, or a stripe may be a column of fractal extracted from a matrix in a row primary format. For a row of fractal extracted from the matrix in the column primary format, in this example, the target matrix may be a matrix in the 4D zZ format.

In this example, the DMA operation may be used as a single DMA instruction. The DMA instruction may include the following parameters: src is the starting address of the stripe in the host memory in the moving of the input matrix; dst is the starting address of the stripe in the input matrix in the accelerator memory; len is equal to N_f; repeat is equal to block_n_n_f; src_stride is set to N_N_f, where N is the leading dimension of the input matrix, as described above. The value of dst_stride is set to 0.

Fig. 7 illustrates the above example. The marked box 710 is the fractal to be moved. Block 720 (a 51, a61, a71, a81, a12, a 22) indicates the data elements to be skipped by the DMA instruction (when moving from the first sub-operation to the second sub-operation) according to the value of src _ stride. The other area 725 represents the remaining matrix data that is not part of the moving stripe. Fractal 710 is a pre-movement fractal and fractal 730 is a post-movement fractal. The large arrow indicates a DMA operation. In some embodiments, a second DMA operation may be used to transform fractal 730 into fractal 740 while the data is being moved. The second DMA operation may move data from one device accelerator memory location to another accelerator memory location. The second DMA operation may use a parameter configuration including source and target start address parameters, a number of fractal to be moved, and a boolean flag indicating whether each fractal is transposed. If the mark is set, each fractal is transposed during the moving, otherwise, each fractal is not transposed. Alternatively, multiple DMA operations may be combined into a composite DMA operation that moves data in the same manner as data is moved by sequentially performing two DMA operations.

It should be noted that, if necessary, the data may be transposed (as shown in 740) by transposing each fractal, etc. In more detail, the format inside each fractal may need to be adjusted according to the format requirements of the accelerator. Each fractal may need to be transposed if the format requirements of the accelerator are different from the original 2D format in the input matrix. This may be required before moving the data to the matrix multiplication accelerator unit. This may occur, for example, when the input matrix is the column master (N) and the format required by the hardware in the fractal is the row master (zZ or zN). Such transposition may be performed using a second DMA operation, for example, that moves data from one accelerator memory location to another.

In more detail, in one example, fig. 7 may illustrate moving a column fractal from a matrix a stored in a column master format to an accelerator such that the column adopts zN format. In this case, the direction of the stripe (row) is different from the direction of the data layout (main column). One row of fractal in matrix a may be loaded into the memory assigned to the right matrix (denoted Y) in the accelerator memory. The format of matrix Y may be nZ format. In host memory, the matrix is stored in column master format, so the order of the data of the second fractal in motion (in host memory) is a33, a43, a34, a44. In the accelerator, the matrix takes the nZ format, so the data is column dominant in the fractal. For example, the order of the data in the second fractal (in the accelerator memory) is a33, a43, a34, a44. Thus, in this example, the matrix is not transposed. In this example, in the case of the order of switching matrices a and B, it is necessary to transpose the result fractal when moving back to host memory. Without swapping A and B, there is no need to transpose the result fractal when moving back to host memory.

It should be noted that since only one stripe (a row or a column of fractal) is acquired at a time, the order between the fractal does not affect the order of the data. The order of the data in a row or column of fractal may be represented by the same data sequence in (one-dimensional) memory. For example, stripes 740 and 742 are represented by the same data sequence in memory, i.e., sequences a31, a32, a41, a42, a33, a34, a43, a44, a35, a36, a45, a46, a37, a38, a47, a48. Stripe 740 is a row fractal with all rows in column primary format; stripe 742 is a column fractal that all uses a column primary format.

One example of moving the fractal of the output matrix (calculated by the accelerator) back from the accelerator to host memory (in the location of the result matrix) is as follows. Here, the output of the accelerator is a fractal (matrix block). If the fractal has the same format as that assigned to the output matrix (e.g., the accelerator generates the fractal in zZ or zN format, while the output matrix is in Z format), the DMA operation may be used as a single DMA instruction. The DMA instruction may include the following parameters: src is the starting address of the fractal in the accelerator memory in the movement of the output matrix; dst is the starting address of the fractal in the output matrix, as it would appear in host memory; len is equal to N_f; repeat is equal to n_f; src_stride is set to 0; dst_stride is set to N-N_f.

Fig. 8 illustrates the above example (the fractal in the shuffle output matrix). The marked box 810 is the fractal to be moved as it appears in the accelerator memory. The marked block 815 is the same fractal after moving to host memory. Block 820 indicates the data elements to be skipped by the DMA instruction according to the value of dst_stride. The other areas 825 represent the remaining matrix data that is not part of the moving stripe. In this figure, the direction of the fractal is changed (that is, transposition is performed). The fractal of one row principal may be moved to host memory for display in a matrix in a column principal format.

In some embodiments, certain operations as described below may be required in order to move the fractal of the output matrix back into host memory, i.e., in order to make it take on the particular format required by the host. Performing these operations may cause the resulting matrix to be displayed in a transposed format after the DMA is shifted. To do this, the left and right input matrices for multiplication may be interchanged when needed. For example, if the fractal of the output matrix (as shown in fig. 8, etc.) is in a column master format, and the host-required format is a row master, these operations may be performed.

In some embodiments, the operations described in connection with fig. 8 may be extended as follows in order to cause the fractal in the output matrix to take on the required format. It should be noted that the operation described in connection with fig. 8 is not changed, but the operation before adjustment corresponds to moving the fractal to the accelerator memory for multiplication. First, the left input matrix (multiplier) provided to the accelerator for multiplication and the right input matrix (multiplicand) provided to the accelerator may be swapped. That is, the left input matrix in the host may be loaded into the memory location assigned to the right matrix in the accelerator, and the right input matrix in the host may be loaded into the memory location assigned to the left matrix in the accelerator. Furthermore, if the format within the fractal required by the accelerator is the same as the order of the matrices in the host (in 2D format), each fractal in the input matrix is transposed before (or simultaneously with) moving the corresponding data from host memory to accelerator memory. Thus, the input is transposed. Otherwise, such transposition is not required. For example, if the accelerator requires a line master format within the fractal (i.e., zZ or zN format), and the host provides the line master format, the fractal is transposed. After multiplication by the accelerator, the output is loaded back into host memory using a DMA operation, for example, as described in connection with FIG. 8.

Fig. 9A and 9B illustrate transpose operations performed on various portions of an input matrix to achieve corresponding transposes in an output matrix. Fig. 9A shows a row of fractal 915 in matrix a acting as a left matrix and a column of fractal 920 in matrix B acting as a right matrix, which when multiplied by a matrix multiplication results in the fractal of shape 925 (performed by the accelerator). Shape of a Chinese characterShape 925 does not achieve the desired transpose, and in fact, fig. 9A is only for reference, and the present embodiment is not implemented. FIG. 9B shows a row fractal 940 (having the same elements as column stripe 920) in matrix B used as the left matrix and a column fractal 935 (having the same elements as row stripe 915) in matrix A used as the right matrix, with matrix A and matrix B being multiplied by matrix multiplication B ^T ×A ^T A fractal of shape 945 is produced. One-line fractal 940 is a transpose of one-column fractal 920 and one-column fractal 935 is a transpose of one-line fractal 915. It can be verified that result 945 is a transpose of result 925 even if the transpose is not explicitly performed on the output matrix or its fractal. For example, in the output fractal 925, the upper right element c34 is derived from a31×b14+a32×b24+ … … +a38×b 84. Similarly, in the output fractal 945, the lower left element c34 is derived by rearrangement (reformatting), but may also be derived by the equivalent result b14×a31+b24×a32+ … … +b84×a38. The movement of the output fractal 945 from the accelerator memory to the host memory may be performed in the same manner as described in connection with fig. 8.

FIG. 10 generally illustrates interaction of a host and an accelerator (e.g., GEMM) provided in one embodiment of the invention. The matrix data is moved from the host 1010 to the accelerator 1020 by a first DMA operation or set of operations 1015. The accelerator performs the matrix multiplication and the result is moved back from accelerator 1020 to host 1010 by second DMA operation 1025. Notably, it is not necessary to perform a separate matrix format conversion operation in the host or accelerator.

FIG. 11 illustrates a computing device provided by an embodiment of the invention. The apparatus includes a host portion 1110, the host portion 1110 including at least a processor 1120 and a memory 1130. The apparatus also includes an accelerator 1140. The apparatus also includes a DMA manager 1150.DMA manager 1150 may be a stand alone device or may be provided by processor 1120 executing program instructions stored in memory 1130. Memory 1130 may typically include volatile and nonvolatile memory such as RAM, ROM, mass storage, and the like. Those skilled in the art will readily understand the different portions of memory and their typical usage. Host matrix memory 1135 is separate from memory 1130 or included in memory 1130, where matrix data accessible to host portion 1110 is stored. The accelerator matrix memory 1145 is also separate from the memory 1130 or included in the memory 1130, where matrix data accessible to the accelerator is stored. The apparatus also includes at least one data bus 1160, the data bus 1160 being operable to couple various components in the apparatus, data being movable between the components via the data bus 1160. At least one such data bus couples memory 1135 to memory 1145 so that data may be moved from one such memory to another. The DMA manager 1150 controls various aspects of such data movement, particularly with respect to matrix data, as described herein. In particular, the DMA manager 1150 or processor 1120 executing program instructions stored in the memory 1130 may be used to configure and control DMA data movement between the host matrix memory 1135 and the accelerator matrix memory 1145, as described herein. The computing device in fig. 11 may be varied in many ways and may include additional components. A direct data connection 1165 between the accelerator 1140 and the accelerator matrix memory 1145 is also shown to emphasize that the accelerator may be architecturally close to the accelerator matrix memory 1145 in order to facilitate matrix multiplication computations.

The matrix multiplication provided by one embodiment is exemplified below. In this example, the fractal 1205 is a 16×16 submatrix, the first matrix a1210 (multiplier) is a 128 row×256 column matrix, the second matrix B1210 (multiplicand) is a 256 row×512 column matrix, and the third matrix C1230 (result of multiplication) is a 128 row×512 column matrix. As shown, the left input matrix X of the accelerator is in zZ format, the right input matrix Y of the accelerator is in nZ format, and the output matrix Z of the accelerator is in zN format. These formats are the format of the accelerator requirements (in the case of input) and the provision (in the case of output). In addition, as stored in host memory, matrices A, B and C take a column master format (not shown).

In order to calculate the result c=a×b using the hardware accelerator, the following operations are performed. To calculate each fractal in the output matrix C, one row of the fractal in matrix a and one column of the fractal in matrix B are used (see fig. 5).

In a first step, a check is performed to determine whether the format within the fractal in the output matrix Z provided by the accelerator and the format within the fractal in the output matrix C required in host memory are the same. In this case, the output matrix C in particular in the host memory is in column-dominant format, but the output matrix Z provided by the accelerator is in row-dominant format ("small" Z in zN ") within the fractal. Therefore, transposition is required. The fractal transposition is not performed explicitly at the output, using the technique shown in fig. 9. That is, matrix a is loaded to the right matrix (multiplicand) Y position in the hardware accelerator, matrix B is loaded to the left matrix (multiplier) X position in the hardware accelerator, and the order of the two matrices is swapped.

Next, for example, as shown in fig. 6, a column of fractal in matrix B is shifted (loaded) to the left matrix position (X position) in the hardware accelerator according to the procedure described in connection with fig. 6. More specifically, the DMA operation of moving the nth column fractal in matrix B to the left matrix (X) position in the accelerator is performed as a set of N DMA instructions, where N is the number of fractal in one stripe, i.e. block_N. The ith DMA instruction may include the following parameters. The value of src is the starting address of the ith fractal of the nth column of fractal in matrix B. The value of src may be n×256×16 (the start address of the nth column) plus i×16×16 (the address of the ith fractal). The value of dst is the starting address of the ith fractal in matrix X in the accelerator memory. The value of dst may be the starting address of the left matrix X-buffer (hardware defined) of the accelerator plus i X16. The value of len is equal to 16, the length of the fractal. The value of repeat is also equal to 16, the length of the fractal. The value of src_stride is set to 256-16=240. Since matrix B is the column master, the leading dimension is the number of rows/elements in a column, i.e. 256, 16 is the length of the fractal. The value of dst_stride is set to 0.

Next, a check is performed to determine the format requirements of the fractal in the left matrix X in the accelerator and the format of the matrix B provided by the host. The accelerator requires that the left matrix X be in zZ format (hence the format inside the fractal is row-dominant) and the host provided matrix B be in column-dominant format. Thus, the fractal in matrix B is not transposed when it is moved to the accelerator.

Next, for example, as shown in fig. 7, according to the procedure described in connection with fig. 7, one row in the matrix a is fractal shifted (loaded) to the right matrix position (Y position) in the hardware accelerator. More specifically, to fractal move a row (e.g., row m) in matrix a to the right (Y) matrix position of the accelerator, a single DMA instruction is used. The DMA instruction may include the following parameters. The value of src is the starting address of the stripe in the move of matrix a (i.e. the mth row fractal) in host memory. The value of src may be m 16. The value of dst is the starting address of the stripe in the right matrix Y in the accelerator memory. The value of dst may be the starting address of the right (Y) matrix buffer (hardware defined). The value of len is equal to 16. The value of repeat is equal to 256, the number of elements in a row of matrix a. The value of src_stride is set to 128-16=112. Since a is the column owner, the leading dimension is the number of rows/elements in a column, i.e. 128, 16 is the length of the fractal. The value of dst_stride is set to 0.

Next, a check is performed to determine the format requirements of the fractal in the right (Y) matrix in the accelerator and the format of matrix a provided by the host. The accelerator requires that the right matrix be in nZ format (hence the format inside the fractal is column dominant) and the host provided matrix a is in column dominant format. Thus, when the fractal in matrix a is moved to the accelerator, transposition is performed on them. This may be performed in a DMA instruction that includes the following parameters. The value of src is the starting address of the pre-transposed fractal. The value of dst is the start address of the transposed fractal. The value of N is the number of fractal to move (256/16=16), i.e. 16 fractal to move per row of fractal in a. The boolean flag in the DMA instruction is set to true (true) to indicate that the fractal is to be transposed while moving. Alternatively, the fractal may be transposed as part of another DMA operation in execution.

Next, using the mth row fractal in matrix a and the nth row fractal in matrix B loaded into the accelerator memory, the accelerator performs a matrix multiplication calculation to calculate the mth row and nth column fractal of the output (result) matrix Z, as provided by the accelerator. The matrix Z is used as a block matrix, wherein the mth row and the nth column refer to the mth row block and the nth column block.

Next, the resultant fractal (or fractal if more than one fractal is provided at a time) in Z is moved back from the accelerator to host memory using a single DMA operation. This is performed as shown in fig. 8, generally described in connection with fig. 8. More specifically, the DMA operation may be used as a single DMA instruction. The DMA instruction may include the following parameters. The value of src is the starting address of the fractal in the movement of the output matrix Z in the accelerator memory (hardware definition). The value of dst is the starting address of the fractal in output matrix C (i.e., the fractal of row m and column n) as it would appear in host memory. The value of dst may be n×128×16 plus m×16×16. Since C (in the host) is the column master, the element in the nth row is calculated first. The value of len is equal to 16. The repeat value is equal to 16. The value of src_stride is set to 0. The value of dst_stride is set to 128-16=112. Since C (in the host) is the column master, the leading dimension is the number of rows/elements in a column, i.e., 128, 16 is the length of the fractal.

It should be noted that, there is no need to perform explicit transposition on the output of the accelerator; transpose has been achieved by operations performed when moving data to an accelerator. The above process is repeated for different rows and columns of fractal as needed to provide different portions of the final output matrix C stored in the host.

It will be appreciated that although specific embodiments of the technology have been described herein for purposes of illustration, various modifications may be made without deviating from the scope of the technology. Accordingly, the specification and drawings are to be regarded only as illustrative of the invention as defined in the appended claims, and are intended to cover any and all modifications, variations, combinations, or equivalents that fall within the scope of the invention. In particular, a computer program product or program element for storing a machine-readable signal, or a program storage or memory device such as a magnetic wire, tape, magnetic disk or optical line, optical tape or optical disk, is provided, within the scope of the present technology, for controlling the operation of a computer according to the methods of the present technology and/or constructing some or all of its components according to the systems of the present technology.

Acts associated with the methods described herein may be implemented in a computer program product as encoded instructions. In other words, the computer program product is a computer readable medium on which the software code is recorded to execute the method when the computer program product is loaded into the memory and executed on the microprocessor of the wireless communication device.

Acts associated with the methods described herein may be implemented as coded instructions in a number of computer program products. For example, a first portion of the method may be performed by one computing device and a second portion of the method may be performed by another computing device, server, etc. In this case, each computer program product is a computer readable medium on which, when loaded into memory and executed on a microprocessor of a computing device, software code is recorded to perform appropriate portions of the method.

Further, each operation of the method may be performed on any computing device, personal computer, server, PDA, etc., and in accordance with or as part of one or more program elements, modules, or objects generated from any programming language, c++, java, etc. Furthermore, each operation or a file or object or the like implementing each of the operations may be performed by dedicated hardware or a circuit module designed for this purpose.

While the invention has been described with reference to specific features and embodiments thereof, it will be apparent that various modifications and combinations of the invention can be made without departing from the invention. Accordingly, the specification and drawings are to be regarded only as illustrative of the invention as defined in the appended claims, and are intended to cover any and all modifications, variations, combinations, or equivalents that fall within the scope of the invention.

Claims

1. A method, the method comprising:

and moving matrix data from a source memory to a target memory using one or more data moving operations, wherein the data moving operations are used for rearranging the matrix data from a first format to a second format, the first format is a format in which the matrix data is stored in the source memory, and the second format is a format in which the matrix data is required to be stored in the target memory.

2. The method of claim 1, wherein the source memory is memory in a host and the target memory is memory in a hardware accelerator for performing matrix multiplication, the method further comprising: the hardware accelerator device performs one or more matrix multiplication operations on the matrix data in the target memory to produce output matrix data.

3. The method according to claim 2, wherein the method further comprises: the output matrix data is moved from memory in the hardware accelerator device to memory in the host computing device using one or more second data movement operations.

4. A method according to claim 3, wherein the second data-moving operation is for rearranging the output matrix data from a third format, which is a format in which the matrix data is provided by the hardware accelerator device, to a fourth format, which is a format in which the output matrix data is required to be stored in a memory in the host computing device.

5. The method of claim 3, wherein the first data movement operation and the second data movement operation are used in common to reorder the output matrix data from a third format to a fourth format, wherein the third format is a format in which the matrix data is provided by the hardware accelerator device, and the fourth format is a format in which the output matrix data is required to be stored in memory in the host computing device.

6. The method of claim 5, wherein the step of determining the position of the probe is performed,

the moving the matrix data from the source memory to the target memory includes: shifting data of a left matrix and shifting data of a right matrix, wherein the left matrix is multiplied by the right matrix;

the first data shifting operation and the second data shifting operation collectively operable to reorder the output matrix data from a third format to a fourth format comprises: exchanging the data of the left matrix with the data of the right matrix to obtain a transpose of the output matrix.

7. An apparatus, the apparatus comprising:

a host matrix memory for storing matrix data in a first format;

An accelerator matrix memory for storing the matrix data in a second format;

hardware accelerator means for performing a matrix multiplication based on the matrix data stored in the accelerator matrix memory;

a direct memory access manager for moving the matrix data from the host matrix memory to the target matrix memory using one or more data movement operations, wherein the data movement operations are used to reorder the matrix data from a first format into a second format, the first format being a format in which the matrix data is stored in the host matrix memory, the second format being a format in which the matrix data is required to be stored in the accelerator matrix memory.

8. The apparatus of claim 7, wherein the hardware accelerator device is to: performing one or more matrix multiplication operations on the matrix data in the accelerator matrix memory to produce output matrix data; and storing the output matrix data in the accelerator matrix memory.

9. The apparatus of claim 8, wherein the direct memory access manager is to move the output matrix data from the accelerator matrix memory to the host matrix memory using one or more second data move operations.

10. The apparatus of claim 9, wherein the second data movement operation is to reorder the output matrix data from a third format to a fourth format, wherein the third format is a format in which the matrix data is provided to the accelerator matrix memory by the hardware accelerator device, and the fourth format is a format in which the output matrix data is required to be stored in the host matrix memory.

11. The apparatus of claim 9, wherein the first data movement operation and the second data movement operation are used in common to reorder the output matrix data from a third format to a fourth format, wherein the third format is a format in which the matrix data is provided to the accelerator matrix memory by the hardware accelerator device, and the fourth format is a format in which the output matrix data is required to be stored in the host matrix memory.

12. The apparatus of claim 11, wherein the device comprises a plurality of sensors,

the moving matrix data from the host matrix memory to the accelerator matrix memory includes: shifting data of a left matrix and shifting data of a right matrix, wherein the left matrix is multiplied by the right matrix;

13. A method, the method comprising:

receiving an instruction of a hardware accelerator to perform matrix multiplication operation by using input matrix data stored in a first format in a host memory as input;

determining that the output of the matrix multiplication operation is stored in the host memory in a second format;

moving the input matrix data stored in the host memory to a hardware accelerator memory;

reformatting the input matrix data from the first format to a third format, wherein the third format is a four-dimensional (4D) format required by the hardware accelerator memory to store the input matrix data for the matrix multiplication operation;

the hardware accelerator performs the matrix multiplication operation on the input matrix data reformatted into the third format stored in the hardware accelerator memory to generate result matrix data in a fourth format, wherein the fourth format is a 4D format required by the hardware accelerator memory to store the result matrix data;

Reformatting the result matrix data into an order compatible with the second format;

and moving the reformatted result matrix data from the hardware accelerator memory to the host memory for storage as an output of the matrix multiplication operation.

14. The method of claim 13, wherein the first format designates an order of data as row-major or column-major and the second format designates an order of data as row-major or column-major.

15. The method of claim 13, wherein the second format is different from the first format.

16. The method of claim 13, wherein reformatting the input matrix data from the first format to a third format comprises: and arranging the input matrix data into one or more fractal, wherein the third format designates the sequence of the input matrix data in the fractal and the sequence between the fractal according to the format required by the memory of the hardware accelerator.

17. The method of claim 16, wherein the third format designates an order of input matrix data within the fractal as row-dominant or column-dominant and an order between the fractal as row-dominant or column-dominant.

18. The method of claim 16, wherein reformatting the input matrix data from the first format to a third format further comprises: the input matrix data is transposed within at least some of the fractal according to a format required for the input matrix data to be stored in the hardware accelerator memory.

19. The method of claim 16, wherein reformatting the input matrix data from the first format to a third format comprises:

determining left and right matrix data to multiply as part of the matrix multiplication operation;

determining that result matrix data of the fourth format is obtained by multiplying the left matrix data with the right matrix data, and that data within a fractal of the result matrix data of the fourth format is a transpose of data within a fractal of the result matrix data of the second format;

exchanging the left side matrix data with the right side matrix data for the matrix multiplication operation to avoid having to transpose the input matrix data within a fractal of the result matrix data;

the reformatting the result matrix data into an order compatible with the second format includes: and reserving the result matrix data.

20. The method of claim 19, wherein said exchanging said left matrix data with said right matrix data comprises: and transferring data in at least one of the left matrix data and the right matrix data.

21. The method of claim 13, wherein reformatting the result matrix data from the fourth format to the second format further comprises: at least some of the result matrix data is transposed.

22. The method of claim 13, wherein the instruction to perform a matrix multiplication operation is received from an application, the method further comprising: the application defines the first format and defines the second format independent of the first format.

23. A method, the method comprising:

Determining that a format of a resulting matrix generated by multiplying left matrix data and right matrix data as part of the matrix multiplication operation corresponds to a transpose of the second format;

reformatting the matrix data from the first format to a third format by exchanging the left matrix data with the right matrix data for the matrix multiplication operation;

the hardware accelerator performing the matrix multiplication operation on the input matrix data reformatted into the third format stored in the hardware accelerator memory to produce resultant matrix data corresponding to the second format;

and moving the result matrix data from the hardware accelerator memory to the host memory to be stored as the output of the matrix multiplication operation.

24. The method of claim 23, wherein said exchanging the left matrix data with the right matrix data comprises: data transposes within at least one of the left side matrix data and the right side matrix data are suppressed to obtain an optimized third format.