WO2023236013A1

WO2023236013A1 - Data re-arrangement by mixed simt and simd execution mode

Info

Publication number: WO2023236013A1
Application number: PCT/CN2022/097200
Authority: WO
Inventors: Peng Zhao; Wei Zhu; Xiaodong Lin
Original assignee: Intel Corporation
Priority date: 2022-06-06
Filing date: 2022-06-06
Publication date: 2023-12-14

Abstract

The application relates to data re-arrangement and provides a method for data re-arrangement. The method may include: identifying a re-arrangement operation on N-D data as a combination of a 2D data transpose operation or a data movement operation on the N-D data, where N is an integer greater than 1; performing, under a condition that the re-arrangement operation comprises the 2D data transpose operation, the 2D data transpose operation based on a combination of SIMT memory coalescing load and SIMD memory coalescing store or a combination of SIMD memory coalescing load and SIMT memory coalescing store; and performing, under a condition that the re-arrangement operation comprises the data movement operation, the data movement operation based on direct memory coalescing load and store.

Description

DATA RE-ARRANGEMENT BY MIXED SIMT AND SIMD EXECUTION MODE

TECHNICAL FIELD

Embodiments described herein generally relate to data processing, and more particularly relate to data re-arrangement by a mixed single instruction multiple threads (SIMT) and single instruction multiple data (SIMD) execution mode.

BACKGROUND

Data re-arrangement is a very important and widely used function in high performance computing, machine learning and deep learning areas. Typical use cases of data re-arrangement may include matrix transpose in numerical computation, numpy transpose in machine learning and tensor permutation in deep learning.

BRIEF DESCRIPTION OF THE DRAWINGS

The various advantages of the embodiments will become apparent to one skilled in the art by reading the following specification and appended claims, and by referencing the following drawings, in which:

FIG. 1 illustrates an example direct transpose approach for a 4×4 matrix based on SIMT data loading and storing;

FIG. 2 illustrates an example improved transpose approach for a 4×4 matrix based on a shared local memory (SLM) ;

FIG. 3 illustrates an example transpose approach for a 4×4 matrix based on a combination of SIMT memory coalescing load and SIMD memory coalescing store according to some embodiments of the disclosure;

FIG. 4 illustrates an example transpose approach for a N×N matrix based on a combination of SIMT memory coalescing load and SIMD memory coalescing store according to some embodiments of the disclosure;

FIG. 5 illustrates a histogram showing performance comparison of different transpose approaches for two dimension (2D) matrix transpose;

FIG. 6 illustrates an example three dimension (3D) data re-arrangement operation according to some embodiments of the disclosure;

FIG. 7 illustrates an example 3D data re-arrangement operation according to some embodiments of the disclosure;

FIG. 8 illustrates an example 3D data re-arrangement operation according to some embodiments of the disclosure;

FIG. 9 illustrates a histogram showing 3D data re-arrangement performance comparison of a conventional direct load-store approach and a proposed data re-arrangement approach according to some embodiments of the disclosure;

FIG. 10 illustrates a table showing 4D data re-arrangement performance comparison of a conventional direct load-store approach and a proposed data re-arrangement approach according to some embodiments of the disclosure;

FIG. 11 illustrates an example flowchart of a N dimension (N-D) data re-arrangement procedure according to some embodiments of the disclosure;

FIG. 12 illustrates an example flowchart of a 2D data transpose procedure according to some embodiments of the disclosure;

FIG. 13 is a block diagram illustrating components, according to some example embodiments, able to read instructions from a machine-readable or computer-readable medium and perform any one or more of the methodologies discussed herein;

FIG. 14 is a block diagram of an example processor platform in accordance with some embodiments of the disclosure.

DETAILED DESCRIPTION

Various aspects of the illustrative embodiments will be described using terms commonly employed by those skilled in the art to convey the substance of the disclosure to others skilled in the art. However, it will be apparent to those skilled in the art that many alternate embodiments may be practiced using portions of the described aspects. For purposes of explanation, specific numbers, materials, and configurations are set forth in order to provide a thorough understanding of the illustrative embodiments. However, it will be apparent to those skilled in the art that alternate embodiments may be practiced without the specific details. In other instances, well-known features may have been omitted or simplified in order to avoid obscuring the illustrative embodiments.

Further, various operations will be described as multiple discrete operations, in turn, in a manner that is most helpful in understanding the illustrative embodiments; however, the order of description should not be construed as to imply that these operations are necessarily order dependent. In particular, these operations need not be performed in the order of presentation.

Data re-arrangement is widely used in high performance computing, machine learning and deep learning areas. Typical use cases of data re-arrangement may include matrix transpose in numerical computation, numpy transpose in machine learning and tensor permutation in deep learning.

A major performance issue in data re-arrangement is un-coalesced reading or writing. For example, in a 2D matrix transpose operation, whatever from row to column or from column to row, the un-coalescing may happen in either reading or writing. Furthermore, a high dimension data re-arrangement operation may be more complicated than the 2D matrix transpose operation, which may lead to un-coalesced memory accesses and long-distance memory footprints and result in a severe performance degradation. Regarding high parallel and bandwidth hardware, like Graphic Processing Unit (GPU) , the penalty from un-coalesced memory accesses may be very heavy and the performance may drop significantly.

In modern GPU or accelerators, a single instruction multiple threads (SIMT) mode is a general instruction execution model for better flexibility in general purpose computation, and a single instruction multiple data (SIMD) mode is an instruction execution model designed for parallel data read and write. Both SIMT and SIMD have been widely supported in instruction level and programmers have ability to select an appropriate instruction execution model for use in actual applications.

For a data transpose operation, a conventional approach may use multiple threads to execute a data loading instruction and a data storing instruction, which is based on SIMT data loading and storing and called a direct transpose approach herein. FIG. 1 illustrates an example direct transpose approach for a 4×4 matrix based on SIMT data loading and storing.

As shown in FIG. 1, four threads may be utilized to read data elements in different rows of the matrix and then write the data elements into one row of a transposed matrix. After four times of reading and writing, all data elements in the 4 ×4 matrix are transposed. From the illustration of FIG. 1, it can be seen that the data reading is not coalescing. For example, in step 1, thread 0 is utilized to read the data element a0, 0, thread 1 is utilized to read the data element a1, 0, thread 2 is utilized to read the data element a2, 0, and thread 3 is utilized to read the data element a3, 0. In this example, the matrix is stored in a row-major order in the memory, that is, a memory coalescing direction is along a row direction. The data elements a0, 0, a1, 0, a2, 0 and a3, 0 are stored in different rows and correspond to discontinuous memory addresses. Therefore, the reading of these data elements by the four threads in step 1 is not coalescing. Similarly, the data reading in steps 2 to 4 is not coalescing either. In a general sense, it can be easily understood that coalescing read/load may means that data elements continuously stored in the memory are read/loaded by one or more threads in a thread group for implementing a read/load operation, and similarly coalescing write/store may means that data elements are written/stored into continuous memory addresses in the memory by one or more threads in a thread group for implementing a write/store operation.

An improved approach for data transpose is to leverage the SIMT mode with vector load and store and then use a shared local memory (SLM) or a register file for 2D transpose. FIG. 2 illustrates an example improved transpose approach for a 4×4 matrix based on SLM. As shown in FIG. 2, with an improved SIMT mode, four threads may be utilized to load data elements in the matrix from a global memory to the SLM and each thread may read four consecutive data elements each time, so the data reading by the four threads is coalescing. Then, the data elements need to be exchanged in the SLM to meet transpose requirements. Finally, the data elements may be written into the global memory by use of the SIMT mode with vector store. In other words, each thread may write four consecutive data elements into the global memory and thus four threads may write all the data elements in the transposed matrix into the global memory.

The transpose approach with register file is very similar to the transpose approach with SLM. Specifically, the data elements in the matrix may be loaded into the register file and then shuffled inside the thread group, and when all the data elements are moved into correct locations in the register file to meet the transpose requirements, the data elements may be written into the global memory in a coalescing way.

With the transpose approach with SLM or register file, the data reading and the data writing may be coalescing. However, the transpose approach with SLM or register file may need extra SLM space or more movement instructions. More importantly, the transpose approach with SLM or register file may not work for re-arrangement of high dimension data. Because of dispersed data distribution of the high dimension data, the transpose approach with SLM or register file cannot load expected data elements into the SLM or the register file in a coalescing way.

According to some embodiments of the disclosure, a mixed SIMT and SIMD execution mode is proposed to implement 2D data transpose and high dimension data re-arrangement. The mixed SIMT and SIMD execution mode may be firstly applied to realize a 2D matrix transpose from an original matrix M (d ₀, d ₁) to a transposed matrix M (d ₁, d ₀) , and then extended to realize re-arrangement of a N-D data (e.g. a N-D tensor) . For the N-D tensor, the re-arrangement may means switching the dimensions (also referred to as axes) of the N-D tensor, e.g., from an original 3D tensor M (d ₀, d ₁, d ₂) to a target 3D tensor M (d ₁, d ₀, d ₂) . Obviously, for the N-D data, there may be a number

of dimension permutations (also referred to as axes permutations) .

Firstly, the mixed SIMT and SIMD execution mode may be applied to implement a 2D transpose operation on a basic block. The basic block may be a 2D matrix which size depends on a vector length supported by the hardware (e.g. GPU) for executing the 2D transpose operation. For example, the vector length supported by the hardware may be 4, 8, 16 or 32, and accordingly, the basic block may be a 4×4 matrix, an 8×8 matrix, a 16×16 matrix or a 32×32 matrix. The 4×4 matrix may be taken as an example of the basic block and the proposed transpose approach for the 4×4 matrix will be described in detail below.

FIG. 3 illustrates an example transpose approach for a 4×4 matrix based on a combination of SIMT memory coalescing load and SIMD memory coalescing store according to some embodiments of the disclosure. As shown in FIG. 3, the transpose of the 4×4 matrix may be realized by SIMT memory coalescing load followed by SIMD memory coalescing store with four threads. Specifically, the four threads may load respective four columns of data elements from the global memory with a SMIT load mode, and a corresponding assembling code may be lsc_load. ugm (M1, 4) , where 4 represents the number of threads in a thread group. It is clear that the data elements in the matrix are loaded from the global memory in a SIMT memory coalescing load mode. After the data loading, each thread has one column of four data elements. Then the instruction execution model may be switched to a SMID memory coalescing store mode. In the SMID memory coalescing store mode, each thread can once store one column of four data elements into the global memory, and a corresponding assembling code may be lsc_store. ugm (M1_NM, 1) . Thus the four columns of data elements can be written into the global memory once with the four threads.

As illustrated by FIG. 3, the 2D transpose operation on the 4×4 matrix can be completed smartly by a combination of SIMT memory coalescing load and SIMD memory coalescing store with a thread group including four threads. Likewise, it may be easily understood that the 2D transpose operation on the 4×4 matrix can also be completed by a combination of SIMD memory coalescing load and SIMT memory coalescing store with a thread group including four threads.

Furthermore, according to some embodiments of the disclosure, the mixed SIMT and SIMD execution mode may be applied to implement a 2D transpose operation on a more generalized N×M matrix. Although the basic block may be designed very well and the transpose of the basic block may be completed smartly with the mixed SIMT and SMID execution mode, the size of the basic block depends on the hardware supported vector length, e.g. 4, 8, 16 or 32. In other words, the transpose approach shown in FIG. 3 can only be used to implement the transpose of the 2D matrix having a size limited by the hardware supported vector length. In practical applications, it may be more desirable to apply the mixed SIMT and SIMD execution mode to complete a transpose operation on a more generalized N×M matrix. It can be easily understood that a transpose operation of a N×M matrix can be implemented based on a transpose operation of a N×N matrix or a M×M matrix. Therefore, for ease of description, a N×N matrix will be taken as an example to illustrate the procedure of the proposed transpose approach.

FIG. 4 illustrates an example transpose approach for a N×N matrix based on a combination of SIMT memory coalescing load and SIMD memory coalescing store according to some embodiments of the disclosure. As illustrated in FIG. 4, the transpose approach for the N×N matrix may include splitting the N×N matrix into one or more basic blocks along a memory coalescing direction, performing a transpose of each basic block based on the combination of SIMT memory coalescing load and SIMD memory coalescing store or the combination of SIMD memory coalescing load and SIMT memory coalescing store, and switching transposed basic blocks present in a diagonal direction of the N×N matrix according to a transpose direction of the transpose operation. The size of the basic block may be determined based on the hardware supported vector length. For example, assuming the hardware supported vector length is denoted as VECL, the basic block may be a VECL×VECL matrix, and a thread group including a number VECL of threads may be used to handle the transpose of the basic block.

As discussed above, with the mixed SIMT and SMID execution mode, the proposed transpose approach can be implemented by memory coalescing load/store operations without using extra SLM or register file and corresponding SLM load/store or movement instructions. For better understanding the advantages of the proposed transpose approach, a performance comparison of different transpose approaches will be illustrated in FIG. 5.

FIG. 5 illustrates a histogram showing performance comparison of different transpose approaches for 2D matrix transpose. The results are shown by achieved bandwidths. The higher bandwidth may indicate the better performance. It is obvious that the performance of the proposed transpose approach based on the mixed SIMT and SIMD execution mode is much better than that of the direct transpose approach and the improved transpose approach based on SLM for 1k, 2k, 4k, 8k and 16k inputs.

According to some embodiments of the disclosure, the proposed transposed approach as described with reference to the foregoing drawings may be extended to implement N-D data re-arrangement. For example, for a N-D tensor, the re-arrangement may means switching the dimensions of the N-D tensor, e.g., from an original 3D tensor M (d ₀, d ₁, d ₂) to a target 3D tensor M (d ₁, d ₀, d ₂) . Obviously, for the N-D data, there may be a number

of dimension permutations. Thus for the 3D tensor, there may be 6 dimension permutations, i.e. M (d ₀, d ₁, d ₂) , M (d ₁, d ₀, d ₂) , M (d ₂, d ₀, d ₁) , M (d ₂, d ₁, d ₀) , M (d ₁, d ₂, d ₀) and M (d ₀, d ₂, d ₁) .

For a re-arrangement operation on a N-D tensor, the re-arrangement operation may be identified as a combination of a 2D data transpose operation or a data movement operation on the N-D tensor, i.e. a pure data movement operation, a pure 2D data transpose operation, or a combination of a 2D data transpose operation and a data movement operation. When N is equal to 2, the N-D data re-arrangement operation is actually a 2D data transpose operation, which can be implemented with the proposed transpose approach shown in FIG. 4. Thus the following embodiments will be given for illustrating various situations of re-arrangement operations on three or higher dimension data.

According to some embodiments, in a first special situation, taking a re-arrangement operation of a N-D tensor as an example, if a data coalescing dimension of the N-D tensor does not change in the re-arrangement operation, the re-arrangement operation may be identified as a pure data movement operation. It is noted that the data coalescing dimension herein may indicate a dimension in which corresponding data elements are contiguously stored in the memory. Typically, a data coalescing dimension in a dimension permutation (d ₀, d ₁, …, d _N-1) of a N-D tensor may be a last dimension d _N-1 in the dimension permutation when the N-D tensor is stored in the memory in a row-major order and may be a first dimension d ₀ in the dimension permutation when the N-D tensor is stored in the memory in a column-major order.

FIG. 6 illustrates an example 3D data re-arrangement operation according to some embodiments of the disclosure. As shown in FIG. 6, a 3D tensor may have an original dimension permutation (d ₀, d ₁, d ₂) and a shape (2, 3, 2) and need to be re-arranged to have a target dimension permutation (d ₁, d ₀, d ₂) and a shape (3, 2, 2) . The last dimensions in the original dimension permutation (d ₀, d ₁, d ₂) and the target dimension permutation (d ₁, d ₀, d ₂) are both d ₂. That is, the data coalescing dimension of the 3D tensor does not change in the re-arrangement operation. Therefore, only a data movement operation is needed to realize the re-arrangement operation of the 3D tensor from the original dimension permutation (d ₀, d ₁, d ₂) to the target dimension permutation (d ₁, d ₀, d ₂) . This situation is the simplest one, in which the re-arrangement operation on the 3D tensor can be easily completed by a data coalescing movement operation.

According to some embodiments, in a second special situation, the re-arrangement operation of the N-D data may be identified as a pure 2D transpose operation. FIG. 7 illustrates an example 3D data re-arrangement operation according to some embodiments of the disclosure. As shown in FIG. 7, a 3D tensor may have an original dimension permutation (d ₀, d ₁, d ₂) and a shape (2, 3, 2) and need to be re-arranged to have a target dimension permutation (d ₂, d ₀, d ₁) and a shape (2, 2, 3) . It can be seen from FIG. 7 that the re-arrangement operation may be transformed into a 2D transpose operation on a 2D matrix of (d ₀×d ₁) by d ₂.

In the disclosure, it is noted that in a dimension permutation (d ₀, d ₁, …, d _N-1) , a subscript of a dimension may be used for identifying an index of the dimension, and a numerical value of the dimension may be used for identifying the number of data elements in the dimension. For example, for a 3D tensor with a dimension permutation (d ₀, d ₁, d ₂) and a shape (2, 3, 2) , there are 2 data elements in the

dimension

0, 3 data elements in the

dimension

1, and 2 data elements in the dimension 2, so the numerical value d ₀ of the dimension 0 is 2, the numerical value d ₁ of the dimension 1 is 3, and the numerical value d ₂ of the dimension 2 is 2. Therefore, in FIG. 7, the re-arrangement operation may be transformed into the 2D transpose operation on the 2D matrix of (2×3) by 2. The definition of the dimension permutation may be applied to various embodiments in the disclosure.

To be more general, under a condition that an original dimension permutation of the N-D data is (d ₀, …, d _I-1, d _I, …, d _N-1) and a target dimension permutation of the N-D data after the re-arrangement operation is (d _I, …, d _N-1, d ₀, …d _I-1) , the re-arrangement operation of the N-D data may be identified as a pure 2D transpose operation on a 2D matrix of (d ₀×…×d _I-1) by (d _I×…×d _N-1) corresponding to the N-D data, where I is an integer between 1 and N-1. For example, for a 5D tensor with an original dimension permutation (a, b, c, d, e) , a re-arrangement operation of the 5D tensor from an original dimension permutation (a, b, c, d, e) to a target dimension permutation (c, d, e, a, b) can be equivalent to a 2D transpose operation on a 2D matrix of (a×b) by (c×d×e) .

For the re-arrangement operation of the N-D data, beside the above two special situations, in a more general situation, the dimension permutation of the N-D data may change in an arbitrary manner. That is, the data coalescing dimension may change, and an order of dimensions in the dimension permutation may change irregularly. In this situation, if a conventional direct load-store approach is used to implement the re-arrangement operation, the performance will drop significantly because the memory access will be random.

According to some embodiments, for the more general situation, it is proposed to implement the re-arrangement operation by a combination of a 2D transpose operation and a data movement operation. Specifically, the re-arrangement operation on the N-D data may be transformed into a combination of a 2D transpose operation on a 2D matrix corresponding to the N-D data and a data movement operation on a transposed matrix corresponding to the 2D matrix. The 2D matrix may be determined based on an original dimension permutation and a target dimension permutation associated with the re-arrangement operation.

FIG. 8 illustrates an example 3D data re-arrangement operation according to some embodiments of the disclosure. As shown in FIG. 8, a 3D tensor may have an original dimension permutation (d ₂, d ₀, d ₁) and a shape (2, 2, 3) and need to be re-arranged to have a target dimension permutation (d ₁, d ₀, d ₂) and a shape (3, 2, 2) . It can be seen from FIG. 8 that the last dimension (i.e. the data coalescing dimension) changes in the re-arrangement operation, and the re-arrangement operation from (d ₂, d ₀, d ₁) to (d ₁, d ₀, d ₂) does not belong to the above-described second special situation. Therefore, the re-arrangement operation needs to be transformed into a combination of a 2D transpose operation and a data movement operation.

From the illustration of FIG. 8, it is observed that in the target dimension permutation, the last dimension d ₂ is the data coalescing dimension, that is, the data elements are contiguous in the dimension d ₂, so a 2D transpose operation on a 2D matrix of d ₂ by (d ₀×d ₁) corresponding to the 3D tensor (d ₂, d ₀, d ₁) may be firstly performed to obtain a transposed 2D matrix of (d ₀×d ₁) by d ₂ which corresponds to the 3D tensor (d ₀, d ₁, d ₂) . Then, from the 3D tensor (d ₀, d ₁, d ₂) to the 3D tensor (d ₁, d ₀, d ₂) , only a data movement operation is needed. Therefore, the re-arrangement operation from the 3D tensor (d ₂, d ₀, d ₁) to the 3D tensor (d ₁, d ₀, d ₂) may be completed by the combination of the 2D transpose operation on the 2D matrix of d ₂ by (d ₀×d ₁) and the data movement operation on the transposed 2D matrix of (d ₀×d ₁) by d ₂.

To be more general, under a condition that the data coalescing dimension of the N-D data changes in the re-arrangement operation and the original dimension permutation of the N-D data is (d ₀, …, d _I-1, d _I, …, d _N-1) but the target dimension permutation of the N-D data after the re-arrangement operation is not (d _I, …, d _N-1, d ₀, …d _I-1) , the re-arrangement operation of the N-D data may be identified as a combination of a 2D transpose operation on a 2D matrix of (d ₀×…×d _M) by (d _M+1×…×d _N-1) corresponding to the N-D data and a data movement operation on a transposed 2D matrix of (d _M+1×…×d _N-1) by (d ₀×…×d _M) , where d _M is a data coalescing dimension in the target dimension permutation of the N-D data, and M is an integer between 0 and N-2.

Taking a 4D tensor as an example, an original dimension permutation of the 4D tensor is (d ₀, d ₁, d ₂, d ₃) and a target dimension permutation of the 4D tensor is (d ₃, d ₁, d ₀, d ₂) . In the target dimension permutation, the data coalescing dimension is d ₂, so firstly a 2D transpose operation from a 2D matrix (d ₀×d ₁×d ₂, d ₃) to a 2D matrix (d ₃, d ₀×d ₁×d ₂) may be performed to obtain a 4D tensor (d ₃, d ₀, d ₁, d ₂) , and then the 4D tensor with the target dimension permutation (d ₃, d ₁, d ₀, d ₂) may be obtained by a data movement operation from the 4D tensor (d ₃, d ₀, d ₁, d ₂) to the 4D tensor (d ₃, d ₁, d ₀, d ₂) with coalescing load and store.

As described above, a re-arrangement operation on N-D data may be identified as a combination of a 2D data transpose operation or a data movement operation on the N-D data. The data movement operation can be simply realized by memory coalescing load and store. In addition, with the propose transpose approach according to the embodiments of the disclosure, the 2D data transpose operation can be realized by a combination of SIMT memory coalescing load and SIMD memory coalescing store or a combination of SIMD memory coalescing load and SIMT memory coalescing store. Therefore, according to the embodiments of the disclosure, the re-arrangement operation on the N-D data can be completed by memory coalescing load and store, and thus the performance of the proposed data re-arrangement approach may be much better than the conventional direct load/store approach.

FIG. 9 illustrates a histogram showing 3D data re-arrangement performance comparison of a conventional direct load-store approach and a proposed data re-arrangement approach according to some embodiments of the disclosure. As shown in FIG. 9, for a 3D tensor, the first re-arrangement operation is from an original dimension permutation (d ₀, d ₁, d ₂) to a target dimension permutation (d ₁, d ₀, d ₂) , which is actually a pure data movement operation, so the performance of the proposed data re-arrangement approach is similar as that of the conventional direct load-store approach. However, for the other re-arrangement operations, the performance of the proposed data re-arrangement approach is obviously much better.

FIG. 10 illustrates a table showing 4D data re-arrangement performance comparison of a conventional direct load-store approach and a proposed data re-arrangement approach according to some embodiments of the disclosure. As shown in FIG. 10, a 4D tensor that is usually used in deep learning areas is taken as an example of N-D data to test the performance of the proposed data re-arrangement approach. For the 4D tensor, there may be 24 combinations of dimension permutations, so the table in FIG. 10 shows all performance of the re-arrangement operations from an original dimension permutation (d ₀, d ₁, d ₂, d ₃) to all possible dimension permutations.

In the table of FIG. 10, for the re-arrangement operations 1 to 5, only a data movement operation is needed for completing a corresponding re-arrangement operation, so the performance of the proposed approach is the same as that of the conventional direct load-store approach, i.e. the speedup is 1.0. For the re-arrangement operations 6 to 11 that can be identified as a pure 2D transpose operation, the speedup is from 1.9x to 7.7x. For the re-arrangement operations 12 to 23, each re-arrangement operation can be identified as a combination of a 2D transpose operation and a data movement operation, so the overall performance may depend on the performance of the transpose operation and the movement operation. From the table, it can be seen that the overall performance of the proposed approach for the re-arrangement operations 12 to 23 is also much better than that of the conventional direct load-store approach, and the highest speedup is up to 8.7x.

For better understanding an overall solution for N-D data re-arrangement proposed in the disclosure, the proposed data re-arrangement approach will be further described with reference to the flowcharts shown in FIG. 11 and FIG. 12.

FIG. 11 illustrates an example flowchart of a N-D data re-arrangement procedure according to some embodiments of the disclosure. The N-D data re-arrangement procedure may be implemented by a processor circuitry and may include operations 1110 to 1130.

At operation 1110, the processor circuitry may identify a re-arrangement operation on N-D data as a combination of a 2D data transpose operation or a data movement operation on the N-D data, where N is an integer greater than 1.

At operation 1120, under a condition that the re-arrangement operation includes the 2D data transpose operation, the processor circuitry may perform the 2D data transpose operation based on a combination of SIMT memory coalescing load and SIMD memory coalescing store or a combination of SIMD memory coalescing load and SIMT memory coalescing store.

At operation 1130, under a condition that the re-arrangement operation includes the data movement operation, the processor circuitry may perform the data movement operation based on direct memory coalescing load and store.

According to some embodiments, under a condition that a data coalescing dimension of the N-D data does not change in the re-arrangement operation, the combination of the 2D data transpose operation or the data movement operation may include only the data movement operation.

According to some embodiments, under a condition that N is equal to 2, the combination of the 2D data transpose operation or the data movement operation may include only the 2D data transpose operation.

According to some embodiments, the processor circuitry may be configured to perform the 2D data transpose operation by: determining a 2D matrix corresponding to the N-D data based on an original dimension permutation and a target dimension permutation associated with the re-arrangement operation; and performing the 2D data transpose operation on the 2D matrix. The details about performing the 2D data transpose operation will be described below with reference to FIG. 12.

According to some embodiments, under a condition that the original dimension permutation of the N-D data is (d ₀, …, d _I-1, d _I, …, d _N-1) and the target dimension permutation of the N-D data after the re-arrangement operation is (d _I, …, d _N-1, d ₀, …d _I-1) , the combination of the 2D data transpose operation or the data movement operation may include only the 2D data transpose operation, where N is greater than 2 and I is an integer between 1 and N-1. The 2D matrix may be a 2D matrix of (d ₀×…×d _I-1) by (d _I×…×d _N-1) corresponding to the N-D data.

According to some embodiments, under a condition that a data coalescing dimension of the N-D data changes in the re-arrangement operation and the original dimension permutation of the N-D data is (d ₀, …, d _I-1, d _I, …, d _N-1) but the target dimension permutation of the N-D data after the re-arrangement operation is not (d _I, …, d _N-1, d ₀, …d _I-1) , the combination of the 2D data transpose operation or the data movement operation may include both the 2D data transpose operation and the data movement operation, where N is greater than 2 and I is an integer between 1 and N-1. The 2D matrix may be a 2D matrix of (d ₀×…×d _M) by (d _M+1×…×d _N-1) corresponding to the N-D data, where d _M is a data coalescing dimension in the target dimension permutation of the N-D data, and M is an integer between 0 and N-2.

According to some embodiments, a data coalescing dimension in a dimension permutation (d ₀, d ₁, …, d _N-1) of the N-D data may be a last dimension d _N-1 in the dimension permutation when the N-D data is stored in a row-major order and may be a first dimension d ₀ in the dimension permutation when the N-D data is stored in a column-major order.

FIG. 12 illustrates an example flowchart of a 2D data transpose procedure according to some embodiments of the disclosure. The 2D data transpose procedure may be implemented by the processor circuitry and may include operations 1210 to 1230.

At operation 1210, the processor circuitry may split the 2D matrix into one or more basic blocks along a memory coalescing direction.

According to some embodiments, a size of the basic block may be determined based on a vector length supported by the processor circuitry.

At operation 1220, the processor circuitry may perform a transpose of each basic block based on the combination of SIMT memory coalescing load and SIMD memory coalescing store or the combination of SIMD memory coalescing load and SIMT memory coalescing store.

According to some embodiments, performing the transpose of each basic block may include: loading all elements of the basic block with a plurality of threads from a global memory based on the SIMT memory coalescing load, and storing all the elements of the basic block with the plurality of threads into the global memory based on the SIMD memory coalescing store. Here, the number of the threads may be determined based on the vector length.

According to some embodiments, performing the transpose of each basic block may include: loading all elements of the basic block with a plurality of threads from a global memory based on the SIMD memory coalescing load, and storing all the elements of the basic block with the plurality of threads into the global memory based on the SIMT memory coalescing store. Also, the number of the threads may be determined based on the vector length.

At operation 1230, the processor circuitry may switch transposed basic blocks present in a diagonal direction of the 2D matrix according to a transpose direction of the 2D transpose operation.

FIG. 13 is a block diagram illustrating components, according to some example embodiments, able to read instructions from a machine-readable or computer-readable medium (e.g., a non-transitory machine-readable storage medium) and perform any one or more of the methodologies discussed herein. Specifically, FIG. 13 shows a diagrammatic representation of hardware resources 1300 including one or more processors (or processor cores) 1310, one or more memory/storage devices 1320, and one or more communication resources 1330, each of which may be communicatively coupled via a bus 1340. For embodiments where node virtualization (e.g., NFV) is utilized, a hypervisor 1302 may be executed to provide an execution environment for one or more network slices/sub-slices to utilize the hardware resources 1300.

The processors 1310 may include, for example, a processor 1312 and a processor 1314 which may be, e.g., a central processing unit (CPU) , a graphics processing unit (GPU) , a tensor processing unit (TPU) , a visual processing unit (VPU) , a field programmable gate array (FPGA) , or any suitable combination thereof.

The memory/storage devices 1320 may include main memory, disk storage, or any suitable combination thereof. The memory/storage devices 1320 may include, but are not limited to any type of volatile or non-volatile memory such as dynamic random access memory (DRAM) , static random-access memory (SRAM) , erasable programmable read-only memory (EPROM) , electrically erasable programmable read-only memory (EEPROM) , Flash memory, solid-state storage, etc.

The communication resources 1330 may include interconnection or network interface components or other suitable devices to communicate with one or more peripheral devices 1304 or one or more databases 1306 via a network 1308. For example, the communication resources 1330 may include wired communication components (e.g., for coupling via a Universal Serial Bus (USB) ) , cellular communication components, NFC components,

components (e.g.,

Low Energy) ,

components, and other communication components.

Instructions 1350 may comprise software, a program, an application, an applet, an app, or other executable code for causing at least any of the processors 1310 to perform any one or more of the methodologies discussed herein. The instructions 1350 may reside, completely or partially, within at least one of the processors 1310 (e.g., within the processor’s cache memory) , the memory/storage devices 1320, or any suitable combination thereof. Furthermore, any portion of the instructions 1350 may be transferred to the hardware resources 1300 from any combination of the peripheral devices 1304 or the databases 1306. Accordingly, the memory of processors 1310, the memory/storage devices 1320, the peripheral devices 1304, and the databases 1306 are examples of computer-readable and machine-readable media.

FIG. 14 is a block diagram of an example processor platform in accordance with some embodiments of the disclosure. The processor platform 1400 can be, for example, a server, a personal computer, a workstation, a self-learning machine (e.g., a neural network) , a mobile device (e.g., a cell phone, a smart phone, a tablet such as an iPad ^TM) , a personal digital assistant (PDA) , an Internet appliance, a DVD player, a CD player, a digital video recorder, a Blu-ray player, a gaming console, a personal video recorder, a set top box, a headset or other wearable device, or any other type of computing device.

The processor platform 1400 of the illustrated example includes a processor 1412. The processor 1412 of the illustrated example is hardware. For example, the processor 1412 can be implemented by one or more integrated circuits, logic circuits, microprocessors, GPUs, DSPs, or controllers from any desired family or manufacturer. The hardware processor may be a semiconductor based (e.g., silicon based) device. In some embodiments, the processor implements one or more of the methods or processes described above.

The processor 1412 of the illustrated example includes a local memory 1413 (e.g., a cache) . The processor 1412 of the illustrated example is in communication with a main memory including a volatile memory 1414 and a non-volatile memory 1416 via a bus 1418. The volatile memory 1414 may be implemented by Synchronous Dynamic Random Access Memory (SDRAM) , Dynamic Random Access Memory (DRAM) ,

Dynamic Random Access Memory

and/or any other type of random access memory device. The non-volatile memory 1416 may be implemented by flash memory and/or any other desired type of memory device. Access to the

main memory

1414, 1416 is controlled by a memory controller.

The processor platform 1400 of the illustrated example also includes interface circuitry 1420. The interface circuitry 1420 may be implemented by any type of interface standard, such as an Ethernet interface, a universal serial bus (USB) , a

interface, a near field communication (NFC) interface, and/or a PCI express interface.

In the illustrated example, one or more input devices 1422 are connected to the interface circuitry 1420. The input device (s) 1422 permit (s) a user to enter data and/or commands into the processor 1412. The input device (s) can be implemented by, for example, an audio sensor, a microphone, a camera (still or video) , a keyboard, a button, a mouse, a touchscreen, a track-pad, a trackball, and/or a voice recognition system.

One or more output devices 1424 are also connected to the interface circuitry 1420 of the illustrated example. The output devices 1424 can be implemented, for example, by display devices (e.g., a light emitting diode (LED) , an organic light emitting diode (OLED) , a liquid crystal display (LCD) , a cathode ray tube display (CRT) , an in-place switching (IPS) display, a touchscreen, etc. ) , a tactile output device, a printer and/or speaker. The interface circuitry 1420 of the illustrated example, thus, typically includes a graphics driver card, a graphics driver chip and/or a graphics driver processor.

The interface circuitry 1420 of the illustrated example also includes a communication device such as a transmitter, a receiver, a transceiver, a modem, a residential gateway, a wireless access point, and/or a network interface to facilitate exchange of data with external machines (e.g., computing devices of any kind) via a network 1426. The communication can be via, for example, an Ethernet connection, a digital subscriber line (DSL) connection, a telephone line connection, a coaxial cable system, a satellite system, a line-of-site wireless system, a cellular telephone system, etc.

For example, the interface circuitry 1420 may include a training dataset inputted through the input device (s) 1422 or retrieved from the network 1426.

The processor platform 1400 of the illustrated example also includes one or more mass storage devices 1428 for storing software and/or data. Examples of such mass storage devices 1428 include floppy disk drives, hard drive disks, compact disk drives, Blu-ray disk drives, redundant array of independent disks (RAID) systems, and digital versatile disk (DVD) drives.

Machine executable instructions 1432 may be stored in the mass storage device 1428, in the volatile memory 1414, in the non-volatile memory 1416, and/or on a removable non-transitory computer readable storage medium such as a CD or DVD.

Additional Notes and Examples:

Example 1 includes an apparatus for data re-arrangement, comprising: interface circuitry; and processor circuitry coupled to the interface circuitry and configured to: identify a re-arrangement operation on N dimension (N-D) data received via the interface circuitry as a combination of a two dimension (2D) data transpose operation or a data movement operation on the N-D data, where N is an integer greater than 1; perform, under a condition that the re-arrangement operation comprises the 2D data transpose operation, the 2D data transpose operation based on a combination of single instruction multiple threads (SIMT) memory coalescing load and single instruction multiple data (SIMD) memory coalescing store or a combination of SIMD memory coalescing load and SIMT memory coalescing store; and perform, under a condition that the re-arrangement operation comprises the data movement operation, the data movement operation based on direct memory coalescing load and store.

Example 2 includes the apparatus of Example 1, wherein the processor circuitry is configured to perform the 2D data transpose operation by: determining a 2D matrix corresponding to the N-D data based on an original dimension permutation and a target dimension permutation associated with the re-arrangement operation; and performing the 2D data transpose operation on the 2D matrix.

Example 3 includes the apparatus of Example 2, wherein performing the 2D data transpose operation on the 2D matrix comprises: splitting the 2D matrix into one or more basic blocks along a memory coalescing direction; performing a transpose of each basic block based on the combination of SIMT memory coalescing load and SIMD memory coalescing store or the combination of SIMD memory coalescing load and SIMT memory coalescing store; and switching transposed basic blocks present in a diagonal direction of the 2D matrix according to a transpose direction of the 2D transpose operation.

Example 4 includes the apparatus of Example 3, wherein a size of the basic block is determined based on a vector length supported by the processor circuitry.

Example 5 includes the apparatus of Example 4, wherein performing the transpose of each basic block comprises: loading all elements of the basic block with a plurality of threads from a global memory based on the SIMT memory coalescing load; and storing all the elements of the basic block with the plurality of threads into the global memory based on the SIMD memory coalescing store, wherein a number of the threads is determined based on the vector length.

Example 6 includes the apparatus of Example 4, wherein performing the transpose of each basic block comprises: loading all elements of the basic block with a plurality of threads from a global memory based on the SIMD memory coalescing load; and storing all the elements of the basic block with the plurality of threads into the global memory based on the SIMT memory coalescing store, wherein a number of the threads is determined based on the vector length.

Example 7 includes the apparatus of any of Examples 1 to 6, wherein under a condition that a data coalescing dimension of the N-D data does not change in the re-arrangement operation, the combination of the 2D data transpose operation or the data movement operation comprises only the data movement operation.

Example 8 includes the apparatus of any of Examples 1 to 6, wherein under a condition that N is equal to 2, the combination of the 2D data transpose operation or the data movement operation comprises only the 2D data transpose operation.

Example 9 includes the apparatus of any of Examples 2 to 6, wherein under a condition that the original dimension permutation of the N-D data is (d ₀, …, d _I-1, d _I, …, d _N-1) and the target dimension permutation of the N-D data after the re-arrangement operation is (d _I, …, d _N-1, d ₀, …d _I-1) , the combination of the 2D data transpose operation or the data movement operation comprises only the 2D data transpose operation, where N is greater than 2 and I is an integer between 1 and N-1.

Example 10 includes the apparatus of Example 9, wherein the 2D matrix is a 2D matrix of (d ₀×…×d _I-1) by (d _I×…×d _N-1) corresponding to the N-D data.

Example 11 includes the apparatus of any of Examples 2 to 6, wherein under a condition that a data coalescing dimension of the N-D data changes in the re-arrangement operation and the original dimension permutation of the N-D data is (d ₀, …, d _I-1, d _I, …, d _N-1) but the target dimension permutation of the N-D data after the re-arrangement operation is not (d _I, …, d _N-1, d ₀, …d _I-1) , the combination of the 2D data transpose operation or the data movement operation comprises both the 2D data transpose operation and the data movement operation, where N is greater than 2 and I is an integer between 1 and N-1.

Example 12 includes the apparatus of Example 11, wherein the 2D matrix is a 2D matrix of (d ₀×…×d _M) by (d _M+1×…×d _N-1) corresponding to the N-D data, where d _M is a data coalescing dimension in the target dimension permutation of the N-D data, and M is an integer between 0 and N-2.

Example 13 includes the apparatus of any of Examples 1 to 12, wherein a data coalescing dimension in a dimension permutation (d ₀, d ₁, …, d _N-1) of the N-D data is a last dimension d _N-1 in the dimension permutation under a condition that the N-D data is stored in a row-major order and is a first dimension d ₀ in the dimension permutation under a condition that the N-D data is stored in a column-major order.

Example 14 includes a method for data re-arrangement, comprising: identifying a re-arrangement operation on N dimension (N-D) data as a combination of a two dimension (2D) data transpose operation or a data movement operation on the N-D data, where N is an integer greater than 1; performing, under a condition that the re-arrangement operation comprises the 2D data transpose operation, the 2D data transpose operation based on a combination of single instruction multiple threads (SIMT) memory coalescing load and single instruction multiple data (SIMD) memory coalescing store or a combination of SIMD memory coalescing load and SIMT memory coalescing store; and performing, under a condition that the re-arrangement operation comprises the data movement operation, the data movement operation based on direct memory coalescing load and store.

Example 15 includes the method of Example 14, performing the 2D data transpose operation comprises: determining a 2D matrix corresponding to the N-D data based on an original dimension permutation and a target dimension permutation associated with the re-arrangement operation; and performing the 2D data transpose operation on the 2D matrix.

Example 16 includes the method of Example 15, wherein performing the 2D data transpose operation on the 2D matrix comprises: splitting the 2D matrix into one or more basic blocks along a memory coalescing direction; performing a transpose of each basic block based on the combination of SIMT memory coalescing load and SIMD memory coalescing store or the combination of SIMD memory coalescing load and SIMT memory coalescing store; and switching transposed basic blocks present in a diagonal direction of the 2D matrix according to a transpose direction of the 2D transpose operation.

Example 17 includes the method of Example 16, wherein a size of the basic block is determined based a vector length supported by a processor circuitry for implementing the method.

Example 18 includes the method of Example 17, wherein performing the transpose of each basic block comprises: loading all elements of the basic block with a plurality of threads from a global memory based on the SIMT memory coalescing load; and storing all the elements of the basic block with the plurality of threads into the global memory based on the SIMD memory coalescing store, wherein a number of the threads is determined based on the vector length.

Example 19 includes the method of Example 17, wherein performing the transpose of each basic block comprises: loading all elements of the basic block with a plurality of threads from a global memory based on the SIMD memory coalescing load; and storing all the elements of the basic block with the plurality of threads into the global memory based on the SIMT memory coalescing store, wherein a number of the threads is determined based on the vector length.

Example 20 includes the method of any of Examples 14 to 19, wherein under a condition that a data coalescing dimension of the N-D data does not change in the re-arrangement operation, the combination of the 2D data transpose operation or the data movement operation comprises only the data movement operation.

Example 21 includes the method of any of Examples 14 to 19, wherein under a condition that N is equal to 2, the combination of the 2D data transpose operation or the data movement operation comprises only the 2D data transpose operation.

Example 22 includes the method of any of Examples 15 to 19, wherein under a condition that the original dimension permutation of the N-D data is (d ₀, …, d _I-1, d _I, …, d _N-1) and the target dimension permutation of the N-D data after the re-arrangement operation is (d _I, …, d _N-1, d ₀, …d _I-1) , the combination of the 2D data transpose operation or the data movement operation comprises only the 2D data transpose operation, where N is greater than 2 and I is an integer between 1 and N-1.

Example 23 includes the method of Example 22, wherein the 2D matrix is a 2D matrix of (d ₀×…×d _I-1) by (d _I×…×d _N-1) corresponding to the N-D data.

Example 24 includes the method of any of Examples 15 to 19, wherein under a condition that a data coalescing dimension of the N-D data changes in the re- arrangement operation and the original dimension permutation of the N-D data is (d ₀, …, d _I-1, d _I, …, d _N-1) but the target dimension permutation of the N-D data after the re-arrangement operation is not (d _I, …, d _N-1, d ₀, …d _I-1) , the combination of the 2D data transpose operation or the data movement operation comprises both the 2D data transpose operation and the data movement operation, where N is greater than 2 and I is an integer between 1 and N-1.

Example 25 includes the method of Example 24, wherein the 2D matrix is a 2D matrix of (d ₀×…×d _M) by (d _M+1×…×d _N-1) corresponding to the N-D data, where d _M is a data coalescing dimension in the target dimension permutation of the N-D data, and M is an integer between 0 and N-2.

Example 26 includes the method of any of Examples 14 to 25, wherein a data coalescing dimension in a dimension permutation (d ₀, d ₁, …, d _N-1) of the N-D data is a last dimension d _N-1 in the dimension permutation under a condition that the N-D data is stored in a row-major order and is a first dimension d ₀ in the dimension permutation under a condition that the N-D data is stored in a column-major order.

Example 27 includes a computer-readable medium having instructions stored thereon, wherein the instructions, when executed by processor circuitry, cause the processor circuitry to perform the method of any of Examples 14 to 26.

Example 28 includes a device for data re-arrangement, comprising means for performing the method of any of Examples 14 to 26.

Various techniques, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, non-transitory computer readable storage medium, or any other machine-readable storage medium, wherein, when the program code is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the various techniques. The non-transitory computer readable storage medium may be a computer readable storage medium that does not include signal. In the case of program code execution on programmable computers, the computing system may include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements) , at least one input device, and at least one output device. The volatile and non-volatile memory and/or storage elements may be a RAM, EPROM, flash drive, optical drive, magnetic hard drive, solid state drive, or other medium for storing electronic data. One or more programs that may implement or utilize the various techniques described herein may use an application programming interface (API) , reusable controls, and the like. Such programs may be implemented in a high level procedural or object oriented programming language to communicate with a computer system. However, the program (s) may be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language, and combined with hardware implementations. Exemplary systems or devices may include without limitation, laptop computers, tablet computers, desktop computers, smart phones, computer terminals and servers, storage databases, and other electronics which utilize circuitry and programmable memory, such as household appliances, smart televisions, digital video disc (DVD) players, heating, ventilating, and air conditioning (HVAC) controllers, light switches, and the like.

The above detailed description includes references to the accompanying drawings, which form a part of the detailed description. The drawings show, by way of illustration, specific embodiments that may be practiced. These embodiments are also referred to herein as “examples. ” Such examples may include elements in addition to those shown or described. However, the present inventors also contemplate examples in which only those elements shown or described are provided. Moreover, the present inventors also contemplate examples using any combination or permutation of those elements shown or described (or one or more aspects thereof) , either with respect to a particular example (or one or more aspects thereof) , or with respect to other examples (or one or more aspects thereof) shown or described herein.

All publications, patents, and patent documents referred to in this document are incorporated by reference herein in their entirety, as though individually incorporated by reference. In the event of inconsistent usages between this document and those documents so incorporated by reference, the usage in the incorporated reference (s) should be considered supplementary to that of this document; for irreconcilable inconsistencies, the usage in this document controls.

In this document, the terms “a” or “an” are used, as is common in patent documents, to include one or more than one, independent of any other instances or usages of “at least one” or “one or more. ” In this document, the term “or” is used to refer to a nonexclusive or, such that “A or B” includes “A but not B, ” “B but not A, ” and “A and B, ” unless otherwise indicated. In the appended claims, the terms “including” and “in which” are used as the plain-English equivalents of the respective terms “comprising” and “wherein. ” Also, in the following claims, the terms “including” and “comprising” are open-ended, that is, a system, device, article, or process that includes elements in addition to those listed after such a term in a claim are still deemed to fall within the scope of that claim. Moreover, in the following claims, the terms “first, ” “second, ” and “third, ” etc. are used merely as labels, and are not intended to impose numerical requirements on their objects.

The above description is intended to be illustrative, and not restrictive. For example, the above-described examples (or one or more aspects thereof) may be used in combination with each other. Other embodiments may be used, such as by one of ordinary skill in the art upon reviewing the above description. The Abstract is to allow the reader to quickly ascertain the nature of the technical disclosure and is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims. Also, in the above Detailed Description, various features may be grouped together to streamline the disclosure. This should not be interpreted as intending that an unclaimed disclosed feature is essential to any claim. Rather, inventive subject matter may lie in less than all features of a particular disclosed embodiment. Thus, the following claims are hereby incorporated into the Detailed Description, with each claim standing on its own as a separate embodiment. The scope of the embodiments should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

Claims

An apparatus for data re-arrangement, comprising: interface circuitry; and processor circuitry coupled to the interface circuitry and configured to:

identify a re-arrangement operation on N dimension (N-D) data received via the interface circuitry as a combination of a two dimension (2D) data transpose operation or a data movement operation on the N-D data, where N is an integer greater than 1;

perform, under a condition that the re-arrangement operation comprises the 2D data transpose operation, the 2D data transpose operation based on a combination of single instruction multiple threads (SIMT) memory coalescing load and single instruction multiple data (SIMD) memory coalescing store or a combination of SIMD memory coalescing load and SIMT memory coalescing store; and

perform, under a condition that the re-arrangement operation comprises the data movement operation, the data movement operation based on direct memory coalescing load and store.
The apparatus of claim 1, wherein the processor circuitry is configured to perform the 2D data transpose operation by:

determining a 2D matrix corresponding to the N-D data based on an original dimension permutation and a target dimension permutation associated with the re-arrangement operation; and

performing the 2D data transpose operation on the 2D matrix.
The apparatus of claim 2, wherein performing the 2D data transpose operation on the 2D matrix comprises:

splitting the 2D matrix into one or more basic blocks along a memory coalescing direction;

performing a transpose of each basic block based on the combination of SIMT memory coalescing load and SIMD memory coalescing store or the combination of SIMD memory coalescing load and SIMT memory coalescing store; and

switching transposed basic blocks present in a diagonal direction of the 2D matrix according to a transpose direction of the 2D transpose operation.
The apparatus of claim 3, wherein a size of the basic block is determined based on a vector length supported by the processor circuitry.
The apparatus of claim 4, wherein performing the transpose of each basic block comprises:

loading all elements of the basic block with a plurality of threads from a global memory based on the SIMT memory coalescing load; and

storing all the elements of the basic block with the plurality of threads into the global memory based on the SIMD memory coalescing store,

wherein a number of the threads is determined based on the vector length.
The apparatus of claim 4, wherein performing the transpose of each basic block comprises:

loading all elements of the basic block with a plurality of threads from a global memory based on the SIMD memory coalescing load; and

storing all the elements of the basic block with the plurality of threads into the global memory based on the SIMT memory coalescing store,

wherein a number of the threads is determined based on the vector length.
The apparatus of claim 1, wherein under a condition that a data coalescing dimension of the N-D data does not change in the re-arrangement operation, the combination of the 2D data transpose operation or the data movement operation comprises only the data movement operation.
The apparatus of claim 1, wherein under a condition that N is equal to 2, the combination of the 2D data transpose operation or the data movement operation comprises only the 2D data transpose operation.
The apparatus of claim 2, wherein under a condition that the original dimension permutation of the N-D data is (d ₀, …, d _I-1, d _I, …, d _N-1) and the target dimension permutation of the N-D data after the re-arrangement operation is (d _I, …, d _N-1, d ₀, …d _I-1) , the combination of the 2D data transpose operation or the data movement operation comprises only the 2D data transpose operation, where N is greater than 2 and I is an integer between 1 and N-1.
The apparatus of claim 9, wherein the 2D matrix is a 2D matrix of (d ₀×…×d _I-1) by (d _I×…×d _N-1) corresponding to the N-D data.
The apparatus of claim 2, wherein under a condition that a data coalescing dimension of the N-D data changes in the re-arrangement operation and the original dimension permutation of the N-D data is (d ₀, …, d _I-1, d _I, …, d _N-1) but the target dimension permutation of the N-D data after the re-arrangement operation is not (d _I, …, d _N-1, d ₀, …d _I-1) , the combination of the 2D data transpose operation or the data movement operation comprises both the 2D data transpose operation and the data movement operation, where N is greater than 2 and I is an integer between 1 and N-1.
The apparatus of claim 11, wherein the 2D matrix is a 2D matrix of (d ₀×…×d _M) by (d _M+1×…×d _N-1) corresponding to the N-D data, where d _M is a data coalescing dimension in the target dimension permutation of the N-D data, and M is an integer between 0 and N-2.
The apparatus of any of claims 1 to 12, wherein a data coalescing dimension in a dimension permutation (d ₀, d ₁, …, d _N-1) of the N-D data is a last dimension d _N-1 in the dimension permutation under a condition that the N-D data is stored in a row-major order and is a first dimension d ₀ in the dimension permutation under a condition that the N-D data is stored in a column-major order.
A method for data re-arrangement, comprising:

identifying a re-arrangement operation on N dimension (N-D) data as a combination of a two dimension (2D) data transpose operation or a data movement operation on the N-D data, where N is an integer greater than 1;

performing, under a condition that the re-arrangement operation comprises the 2D data transpose operation, the 2D data transpose operation based on a combination of single instruction multiple threads (SIMT) memory coalescing load and single instruction multiple data (SIMD) memory coalescing store or a combination of SIMD memory coalescing load and SIMT memory coalescing store; and

performing, under a condition that the re-arrangement operation comprises the data movement operation, the data movement operation based on direct memory coalescing load and store.
The method of claim 14, performing the 2D data transpose operation comprises:

determining a 2D matrix corresponding to the N-D data based on an original dimension permutation and a target dimension permutation associated with the re-arrangement operation; and

performing the 2D data transpose operation on the 2D matrix.
The method of claim 15, wherein performing the 2D data transpose operation on the 2D matrix comprises:

splitting the 2D matrix into one or more basic blocks along a memory coalescing direction;

performing a transpose of each basic block based on the combination of SIMT memory coalescing load and SIMD memory coalescing store or the combination of SIMD memory coalescing load and SIMT memory coalescing store; and

switching transposed basic blocks present in a diagonal direction of the 2D matrix according to a transpose direction of the 2D transpose operation.
The method of claim 16, wherein a size of the basic block is determined based a vector length supported by a processor circuitry for implementing the method.
The method of claim 17, wherein performing the transpose of each basic block comprises:

loading all elements of the basic block with a plurality of threads from a global memory based on the SIMT memory coalescing load; and

storing all the elements of the basic block with the plurality of threads into the global memory based on the SIMD memory coalescing store,

wherein a number of the threads is determined based on the vector length.
The method of claim 17, wherein performing the transpose of each basic block comprises:

loading all elements of the basic block with a plurality of threads from a global memory based on the SIMD memory coalescing load; and

storing all the elements of the basic block with the plurality of threads into the global memory based on the SIMT memory coalescing store,

wherein a number of the threads is determined based on the vector length.
The method of claim 14, wherein under a condition that a data coalescing dimension of the N-D data does not change in the re-arrangement operation, the combination of the 2D data transpose operation or the data movement operation comprises only the data movement operation.
The method of claim 14, wherein under a condition that N is equal to 2, the combination of the 2D data transpose operation or the data movement operation comprises only the 2D data transpose operation.
The method of claim 15, wherein under a condition that the original dimension permutation of the N-D data is (d ₀, …, d _I-1, d _I, …, d _N-1) and the target dimension permutation of the N-D data after the re-arrangement operation is (d _I, …, d _N-1, d ₀, …d _I-1) , the combination of the 2D data transpose operation or the data movement operation comprises only the 2D data transpose operation, and the 2D matrix is a 2D matrix of (d ₀×…×d _I-1) by (d _I×…×d _N-1) corresponding to the N-D data, where N is greater than 2 and I is an integer between 1 and N-1.
The method of claim 15, wherein under a condition that a data coalescing dimension of the N-D data changes in the re-arrangement operation and the original dimension permutation of the N-D data is (d ₀, …, d _I-1, d _I, …, d _N-1) but the target dimension permutation of the N-D data after the re-arrangement operation is not (d _I, …, d _N-1, d ₀, …d _I-1) , the combination of the 2D data transpose operation or the data movement operation comprises both the 2D data transpose operation and the data movement operation, and the 2D matrix is a 2D matrix of (d ₀×…×d _M) by (d _M+1×…×d _N-1) corresponding to the N-D data, where d _M is a data coalescing dimension in the target dimension permutation of the N-D data, N is greater than 2, I is an integer between 1 and N-1, and M is an integer between 0 and N-2.
A computer-readable medium having instructions stored thereon, wherein the instructions, when executed by processor circuitry, cause the processor circuitry to perform the method of any of claims 14 to 23.
A device for data re-arrangement, comprising means for performing the method of any of claims 14 to 23.