CN117271136A

CN117271136A - Data processing method, device, equipment and storage medium

Info

Publication number: CN117271136A
Application number: CN202311373162.5A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Bi Ren Technology Co ltd
Current assignee: Shanghai Bi Ren Technology Co ltd
Priority date: 2023-10-20
Filing date: 2023-10-20
Publication date: 2023-12-22

Abstract

Embodiments of the present disclosure provide a data processing method, apparatus, device, and computer readable storage medium. The method provided by the embodiment of the disclosure divides an input tensor into a plurality of data blocks based on the minimum alignment granularity of the GPGPU, and converts the input tensor into a one-dimensional tensor based on the divided plurality of data blocks so as to decompose a calculation task of the input tensor in the dimension of the one-dimensional tensor, thereby decomposing the calculation task of the input tensor into a plurality of independent subtasks which can be executed on hardware resources of the GPU in parallel. According to the method, task decomposition can be performed on the dimension of the one-dimensional tensor, so that the computation parallelism of the GPGPU is improved, the higher hardware resource utilization rate is realized, and the overall computation performance and efficiency are improved.

Description

Data processing method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of data processing, and more particularly, to a data processing method, apparatus, device, and storage medium.

Background

In recent years, the computing power of graphics processing units (Graphics Processing Unit, GPUs) has been significantly improved, making the use of general-purpose computing on GPUs more and more popular. General-purpose graphics processing units (GPGPU) are one technique for performing General-purpose computations using GPUs. The compute core of a GPGPU may execute multiple threads simultaneously, thereby enabling efficient parallel computation, which makes gpgpgpu advantageous in processing large-scale data and intensive compute tasks. The application fields of GPGPU are very wide, including scientific computing, data analysis, machine learning, deep learning, cryptography, etc. By utilizing the parallel computing capability of the GPGPU, the execution speed of the tasks can be accelerated, and the computing efficiency is improved.

When the computation task is executed on the GPGPU, the parallel computation capability of the GPGPU can be utilized to decompose the computation task into a plurality of independent subtasks, and the subtasks are executed on the GPGPU at the same time, so that the overall computation performance and efficiency are improved. Therefore, the allocation of computing tasks to hardware resources in a GPGPU becomes a critical issue, which relates to how to reasonably allocate computing tasks to different hardware units to achieve efficient parallel computing.

Therefore, a more efficient data processing method is needed to improve the computation parallelism of the GPGPU, thereby improving the overall computation performance and efficiency.

Disclosure of Invention

In order to solve the problems, the input tensor is folded into the one-dimensional tensor based on the minimum alignment granularity of the GPGPU, and task allocation of hardware resources of the GPGPU is carried out on the one-dimensional tensor, so that the utilization rate of the hardware resources is improved, and the calculation parallelism and the overall calculation performance and efficiency of the GPGPU are improved.

Embodiments of the present disclosure provide a data processing method, apparatus, device, and computer readable storage medium.

The embodiment of the disclosure provides a data processing method, which comprises the following steps: acquiring an input tensor; dividing the input tensor into a plurality of data blocks based on a minimum alignment granularity of the GPGPU, and converting the input tensor into a one-dimensional tensor, wherein each element in the one-dimensional tensor corresponds to one data block, and the minimum alignment granularity is used for indicating that a starting address of a storage address of each data block in the plurality of data blocks in a memory of the GPGPU is aligned based on the minimum alignment granularity; determining task allocation of the computing task on hardware resources of the GPGPU based on the one-dimensional tensor according to the computing task aiming at the input tensor; and executing the computing task for the input tensor based on the task allocation of the computing task on the hardware resources of the GPGPU.

Embodiments of the present disclosure provide a data processing apparatus including: a data acquisition module configured to acquire an input tensor; a tensor folding module configured to divide the input tensor into a plurality of data blocks based on a minimum alignment granularity of the GPGPU, and to convert the input tensor into a one-dimensional tensor, wherein each element in the one-dimensional tensor corresponds to a data block, the minimum alignment granularity being used to indicate that a starting address of a storage address of each of the plurality of data blocks in a memory of the GPGPU is aligned based on the minimum alignment granularity; a task decomposition module configured to determine a task allocation of a computing task on hardware resources of the GPGPU based on the one-dimensional tensor according to the computing task for the input tensor; and a task execution module configured to execute a computing task for the input tensor based on task allocation of the computing task on hardware resources of the GPGPU.

Embodiments of the present disclosure provide a data processing apparatus including: one or more processors; and one or more memories, wherein the one or more memories have stored therein a computer executable program which, when executed by the processor, performs the data processing method as described above.

Embodiments of the present disclosure provide a computer readable storage medium having stored thereon computer executable instructions which, when executed by a processor, are for implementing a data processing method as described above.

Embodiments of the present disclosure provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. A processor of a computer device reads the computer instructions from a computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs a data processing method according to an embodiment of the present disclosure.

The method provided by the embodiment of the disclosure divides an input tensor into a plurality of data blocks based on the minimum alignment granularity of the GPGPU, and converts the input tensor into a one-dimensional tensor based on the divided plurality of data blocks so as to decompose a calculation task of the input tensor in the dimension of the one-dimensional tensor, thereby decomposing the calculation task of the input tensor into a plurality of independent subtasks which can be executed on hardware resources of the GPU in parallel. According to the method, task decomposition can be performed on the dimension of the one-dimensional tensor, so that the computation parallelism of the GPGPU is improved, the higher hardware resource utilization rate is realized, and the overall computation performance and efficiency are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are used in the description of the embodiments will be briefly described below. It should be apparent that the drawings in the following description are only some exemplary embodiments of the present disclosure, and that other drawings may be obtained from these drawings by those of ordinary skill in the art without undue effort.

FIG. 1 is a schematic diagram illustrating task decomposition in three dimensions according to an embodiment of the present disclosure;

FIG. 2 is a flow chart illustrating a data processing method according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram illustrating a data processing method according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating a first example of converting an input tensor into a one-dimensional tensor according to an embodiment of the present disclosure;

FIG. 5A is a schematic diagram illustrating a second example of converting an input tensor to a one-dimensional tensor according to an embodiment of the present disclosure;

FIG. 5B is a schematic diagram illustrating task allocation of hardware resources of a GPGPU based on a second example of converting an input tensor into a one-dimensional tensor, according to an embodiment of the disclosure;

FIG. 6A is a schematic diagram illustrating a third example of converting an input tensor to a one-dimensional tensor according to an embodiment of the present disclosure;

FIG. 6B is a schematic diagram illustrating task allocation of hardware resources of a GPGPU based on a third example of converting an input tensor into a one-dimensional tensor, according to an embodiment of the disclosure;

FIG. 7 is a schematic diagram illustrating a data processing apparatus according to an embodiment of the present disclosure;

FIG. 8 shows a schematic diagram of a data processing apparatus according to an embodiment of the present disclosure; and

fig. 9 shows a schematic diagram of an architecture of an exemplary computing device, according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more apparent, exemplary embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present disclosure and not all of the embodiments of the present disclosure, and that the present disclosure is not limited by the example embodiments described herein.

In the present specification and drawings, steps and elements having substantially the same or similar are denoted by the same or similar reference numerals, and repeated descriptions of the steps and elements will be omitted. Meanwhile, in the description of the present disclosure, the terms "first," "second," and the like are used merely to distinguish the descriptions, and are not to be construed as indicating or implying relative importance or order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.

For purposes of describing the present disclosure, the following presents concepts related to the present disclosure.

The data processing method of the present disclosure may be based on GPGPU computing. GPGPU computing takes advantage of the parallel computing capabilities of GPGPUs to accelerate various computationally intensive tasks. The basic principle of GPGPU computing is to break up a computing task into multiple parallel sub-tasks, which are then distributed to the execution structures on the GPGPU for parallel execution. There are dense execution structures on the GPGPU chip, and these execution structures may be composed into larger execution structures according to a certain granularity and rule, and a hierarchical structure of the execution structures on the GPGPU will be described below. As an example, in a GPGPU, an Execution structure for processing parallel computing tasks may include, for example, threads (threads), execution Units (EU), computing units (computer units, CU), and stream processor cores (Streaming Processor Core, SPC), etc. The thread is used as the most basic execution structure and can be responsible for executing an instruction or a calculation task, and in the GPGPU, a large number of threads are executed simultaneously so as to realize high-level parallel calculation. The execution units EU may be comprised of multiple threads, each of which may execute instructions of multiple threads, for example, to perform a set of related computational tasks, such as vector operations, scalar operations, and the like. A CU may be composed of a plurality of EUs as an execution structure higher than EU level. A CU may perform multiple parallel computing tasks simultaneously, thereby increasing overall computing throughput. SPC, as a basic computational unit in a GPGPU chip, may be responsible for performing parallel computing tasks, which may typically consist of multiple CUs. By employing an appropriate organization of the execution structure, greater parallel computing power and flexibility may be provided.

In summary, the solution provided by the embodiments of the present disclosure relates to GPGPU computing, and the embodiments of the present disclosure will be further described below with reference to the accompanying drawings.

GPGPU is used as a highly parallel processor, has a large amount of computing core and memory bandwidth, and is suitable for executing parallel computing tasks. Task decomposition on a GPGPU is one of the key steps in achieving parallel computing.

In some existing methods of task decomposition on a GPGPU, when assigning tasks to an execution structure, the execution structure is typically divided into multiple SPCs, EUs, and threads to process tensors (tensors) that need to be processed in parallel. Fig. 1 is a schematic diagram illustrating task decomposition in three dimensions according to an embodiment of the present disclosure. As shown in fig. 1, assume that the tensor to be processed has three dimensions, N, H and W, respectively, where N represents the batch size, H represents the height, and W represents the width. In order to efficiently utilize the parallel computing power of the GPU, these methods map the execution structures SPC, EU, and thread to these three dimensions, respectively, for task decomposition.

For example, the task decomposition may be performed by mapping various execution structures onto respective dimensions of the input tensor, respectively. For example, for a three-dimensional input tensor, different execution structures SPC, EU, and threads may be mapped onto N, H and W dimensions, respectively, to decompose a computing task into multiple subtasks (i.e., generate multiple subtask instructions) for execution by the different execution structures, thereby enabling parallel computing.

However, in the above task decomposition method, there are the following two main problems:

1. task decomposition imbalance: since the amount of tasks in different dimensions for the input tensor may be different, it may lead to an imbalance in the task decomposition, i.e. some execution structures may process more or less tasks than others, resulting in some execution structures being idle and others being overloaded, which in turn may lead to a waste of hardware resources and may also reduce the overall computational efficiency.

2. The video memory access efficiency is low: because the data access modes in different dimensions are different, for example, if different execution structures are mapped to different dimensions respectively for the three-dimensional tensor, the video memory address of the data accessed in the calculation process frequently changes greatly, so that the video memory access efficiency and the cache hit are low, and the calculation efficiency is reduced. In addition, in this task decomposition method, the memory bandwidth is difficult to fully utilize, and the calculation speed is limited.

To solve these problems, it is necessary to optimize a task decomposition method to improve the calculation efficiency and the memory access efficiency. In the embodiment of the disclosure, task decomposition may be performed according to specific situations, for example, an appropriate decomposition policy may be determined according to characteristics of hardware resources and tasks, so as to ensure balanced task allocation and make maximum use of video memory bandwidth, which needs to comprehensively consider factors such as size of input tensors, hardware architecture, parallelism of tasks, and the like.

Based on the disclosure, a data processing method is provided, wherein input tensors are folded into one-dimensional tensors based on the minimum alignment granularity of the GPGPU, task allocation of hardware resources of the GPGPU is carried out on the one-dimensional tensors, and therefore the utilization rate of the hardware resources is improved, and the calculation parallelism of the GPGPU and the overall calculation performance and efficiency are improved.

Fig. 2 is a flowchart illustrating a data processing method 200 according to an embodiment of the present disclosure. Fig. 3 is a schematic diagram illustrating a data processing method according to an embodiment of the present disclosure.

Alternatively, the data processing methods of the present disclosure may be applicable to the development of operators, for example, on a GPGPU, where an operator may be an operator or function for performing a particular computational task, including, but not limited to, operations such as mathematical operations, logical operations, bit operations, image processing, data compression, ordering, and the like. The computational tasks of these operators typically require a large amount of parallel computation and data processing, while the parallel computing power and high bandwidth memory of the GPGPU can significantly improve the execution efficiency and performance of the operators. Thus, parallel computing of operators on a GPGPU may speed up the computation process and improve overall computational performance. For the development scenario of operators on the GPGPU, tasks need to be reasonably allocated to execution units on the GPGPU to improve the parallelism of the operators. That is, in embodiments of the present disclosure, the computing tasks of the operators may be decomposed into multiple independent sub-computing tasks and executed on the GPGPU at the same time to fully utilize the parallel computing capabilities of the GPGPU, improving overall computing performance and efficiency. Of course, it should be understood that the above scenario is merely exemplary and not limiting in this disclosure, and the data processing method of this disclosure is equally applicable to other various scenarios.

In step S201, an input tensor may be acquired.

Alternatively, tensors may be obtained as input to the data processing methods of the present disclosure, i.e., as input to a computing task. The input tensor may refer to multi-dimensional array data that needs to be computed on the GPGPU, which may be various types of data, such as images, audio, text, or any other type of data. The dimensions, shape, and data type of the input tensor may depend on the particular application scenario and task requirements. For example, for image processing tasks, the input tensor may typically have three or four dimensions, representing the height, width, number of channels, and batch size of the image, respectively. For audio processing tasks, the input tensor may typically have two or three dimensions, representing the number of sample points, the number of channels, and the batch size of the audio, respectively. For text processing tasks, the input tensor may typically have two dimensions, representing the length of the text and the batch size, respectively.

Alternatively, the data of the input tensor may be loaded and transferred onto the GPGPU in a variety of ways. For example, common approaches may include copying data from host memory to GPGPU memory, or distributing and initializing data directly in GPGPU memory. After the calculations are performed, the results may be copied from the GPGPU memory back to host memory for further processing and analysis.

Specifically, as an example, the input tensor that needs to be processed may be loaded from the host memory into the GPGPU memory, and allocated to the execution unit for calculation. Thus, the execution unit can directly read the input data from the GPGPU memory and calculate the input tensor in parallel. For example, in the case where there are multiple execution units, each execution unit may be responsible for processing different portions of the data of the input tensor to fully utilize the parallel computing capabilities of the GPGPU, increasing the computing speed. Thus, after the calculation is completed in this way, the execution unit may write the calculation result into an output tensor, which is the output of the calculation task. Therefore, by loading the input tensor into the GPGPU memory and distributing the input tensor to the execution unit for calculation, the data transmission and processing overhead can be reduced to the maximum extent, and the calculation efficiency is improved, which is an important step for performing high-performance calculation on the GPGPU.

In step S202, the input tensor may be divided into a plurality of data blocks based on a minimum alignment granularity of the GPGPU, and the input tensor is converted into a one-dimensional tensor, where each element in the one-dimensional tensor corresponds to a data block, and the minimum alignment granularity is used to indicate that a start address of a storage address of each data block in the plurality of data blocks in a memory of the GPGPU is aligned based on the minimum alignment granularity.

Optionally, in an embodiment of the present disclosure, the dimension of the input tensor may be folded according to a minimum alignment granularity defined by a hardware layout of the GPGPU, so as to convert the input tensor into a one-dimensional tensor, where the minimum alignment granularity defined by the hardware layout of the GPGPU may refer to that when the data is accessed in a memory of the GPGPU, the data needs to be stored in a certain byte alignment manner, so as to improve the memory access efficiency. Specifically, the multidimensional input tensor can be reorganized into a one-dimensional array, and the memory alignment is performed according to the hardware layout requirement, wherein the hardware layout defines how to map the dimension of the input tensor to the memory of the GPGPU.

Therefore, the input tensor can be converted into a one-dimensional tensor according to the storage mode of the data of the input tensor in the memory of the GPGPU. In embodiments of the present disclosure, the input tensor may be divided into a plurality of data blocks according to a minimum alignment granularity defined by a hardware layout of the GPGPU, where the respective data blocks are aligned in a certain byte alignment. Wherein the alignment is to align the addresses of the data stored in the memory according to a certain rule, for example, when the minimum alignment granularity is 4 bytes, the data blocks are to be aligned and stored with a boundary of 4 bytes, that is, the starting address of the data block must be a multiple of 4.

As an example, in an embodiment of the present disclosure, an input tensor may be converted into a one-dimensional tensor by performing data block partitioning 301 and tensor folding 302 on the input tensor, as shown in fig. 3.

For example, as shown in FIG. 3, for a three-dimensional input tensor, assuming that the three dimensions of the input tensor are N, H and W, respectively, where N represents the batch size, H represents the height, and W represents the width, for ease of description, the description herein will be presented with respect to data block partitioning and tensor folding in the H and W dimensions and subsequent task allocation as examples, but it should be understood that these processes can be similarly applied to other dimensions as well (including, but not limited to, N or other additional dimensions, for example).

Alternatively, different data blocks divided from the input tensor may be stored in different memory areas, respectively, and the data blocks may be stored in a certain byte alignment. Alternatively, the minimum alignment granularity of the GPGPU may indicate the manner in which the individual data blocks are aligned, and may include, for example, the number of consecutively aligned data blocks and the amount of data that each data block includes. For example, each data block may include 4×32 bits of data (i.e., 16 bytes of data), i.e., data corresponding to 32 columns including 4 rows of the H dimension and the W dimension in the input tensor, where a minimum alignment granularity defined by the hardware layout of the GPGPU may require alignment every 4 data blocks (e.g., 64 bytes). In this case, therefore, the actual data in the H and W dimensions (shown as shaded rectangles) may be divided into multiple data blocks and aligned in such a way that every 4 data blocks (and where each data block includes 4 x 32 bits of data) are aligned. As shown in fig. 3, the actual data of the input tensor in the H and W dimensions includes a data amount of less than 7 data blocks in the H dimension and a data amount of less than 6 data blocks in the W dimension, so according to the minimum alignment granularity of the GPGPU, for example, 4 data blocks of 4×32 bits are aligned, the data block division of the input tensor in the H and W dimensions may include division into 8 data blocks in the H dimension (the portion outside the actual data may be filled with a predefined filling value, for example, a zero value or other specific value) and division into 6 data blocks in the W dimension (the portion outside the actual data may be filled with a predefined filling value).

Thus, as described above, the input tensor may be divided into multiple data blocks based on the minimum alignment granularity of the GPGPU.

According to an embodiment of the present disclosure, for a plurality of memory regions in a memory of the GPGPU for storing the input tensor, data in a data block corresponding to one element in the one-dimensional tensor may be stored in one of the plurality of memory regions. Alternatively, all data in one data block divided from the input tensor may be stored in the same memory region to make the access of the GPGPU to the memory more continuous.

According to an embodiment of the present disclosure, for different elements in the one-dimensional tensor, different data blocks corresponding to the different elements are stored in one or more of the plurality of memory regions based on their locations in the input tensor and a predetermined storage layout indicating a manner in which data blocks in different dimensions of the input tensor are arranged in memory of the GPGPU.

Alternatively, different data blocks divided from the input tensor may be stored in different memory areas, and the manner in which these different data blocks are stored may be based on their locations in the input tensor and a predetermined storage layout. For example, the predetermined storage layout may indicate an order of storage of the data blocks in the respective dimensions of the input tensor in the memory of the GPGPU, e.g., the respective data blocks in the input tensor are stored layer by layer in an order of H-dimension for the innermost layer, W-dimension for the minor outer layer, and N-dimension for the outermost layer. As shown in fig. 3, taking the H and W dimensions as an example, the data blocks may be stored in a manner shown by a plurality of dashed arrows in a rectangular frame, where Q memory areas represent the memory areas currently available for the input tensor and Q is not greater than the total number of the memory areas currently available, P data blocks may be P data blocks that are continuous along the dashed arrows, and the data blocks may be sequentially stored in the corresponding memory areas of the Q memory areas in the order shown by the dashed arrows. Alternatively, several blocks of data that are consecutive based on the position in the input tensor and the predetermined storage layout may be written to the memory in a circular manner. For example, in the case of continuously storing 12 data blocks in 6 memory areas of the GPGPU, first, the first 6 data blocks may be sequentially written into the 6 memory areas, respectively, and then the last 6 data blocks may be sequentially written into the 6 memory areas, respectively.

By the storage mode, data in the input tensor can be uniformly stored in each memory area of the GPGPU. Of course, it should be understood that the above data storage manner is used in this disclosure by way of example only and not limitation, and that other data storage manners may be employed by the methods of the present disclosure that achieve the same results.

According to an embodiment of the present disclosure, converting the input tensor into a one-dimensional tensor may include: the plurality of data blocks are arranged in the predetermined storage layout based on the position of each of the plurality of data blocks in the input tensor to convert the input tensor into the one-dimensional tensor.

After the data block division 301 of the input tensor is completed as described above, a tensor folding 302 may be performed based on the determined plurality of data blocks. Alternatively, the data blocks may be spliced according to the storage order of the data blocks indicated by the above-described predetermined storage layout to generate a one-dimensional tensor corresponding to the input tensor. Alternatively, the computational tasks of the input tensor in the present disclosure may be computation tasks that are insensitive to memory order, i.e., do not depend on the relative positions or order of the elements in the input tensor during computation. That is, the data processing method in the present disclosure can be applied to a data processing scene in which the order of access of elements of an input tensor is not required, and thus the data blocks of the input tensor can be stored and accessed in a desired order.

As shown in fig. 3, the input tensor is divided into h_tile (e.g., 8) and w_tile (e.g., 6) parts in the H and W dimensions, respectively, i.e., the input tensor is divided into h_tile x w_tile data blocks, e.g., 8 x 6 data blocks b1-b48, which can be spliced into a left one-dimensional tensor based on a predetermined storage layout, wherein the data blocks b1-b8 correspond to the first 8 elements of the one-dimensional tensor, next the data blocks b9-b16 correspond to the subsequent 8 elements of the one-dimensional tensor, and so on.

Based on the one-dimensional tensors obtained through the above steps, as shown in fig. 3, task allocation 303 for the calculation task of the input tensor can be performed.

In step S203, a task allocation of the computing task on the hardware resources of the GPGPU may be determined based on the one-dimensional tensor according to the computing task for the input tensor.

Alternatively, for task allocation 303, the following aspects may be considered: (1) data parallelism: that is, considering whether a calculation task for an input tensor can be decomposed into a plurality of independent sub-calculation tasks for parallel calculation, if there is no dependency between the sub-calculation tasks and different data blocks can be processed in parallel, the parallelism can be improved by data parallelism; (2) task allocation: namely, the computing tasks are reasonably distributed to different execution units, and the task distribution strategy can be determined according to factors such as the complexity of the tasks, the data dependency relationship, the computing load and the like, for example, the computing-intensive tasks can be distributed to CUs, and the data-intensive tasks can be distributed to EUs; (3) and (3) data transmission: in distributed computing, data may need to be transferred from host memory to GPU memory, or between GPU memories, and therefore, data transfer needs to be managed reasonably, reducing overhead of data transfer, and thus improving execution efficiency of computing tasks.

In embodiments of the present disclosure, for each element in the one-dimensional tensor described above, i.e., each data block, a corresponding SPC in the hardware resources of the GPGPU may be utilized to perform the sub-computation task for that data block. Alternatively, for a first number of SPCs available to perform calculation tasks for the input tensor, each SPC may be used to process sub-calculation tasks for a plurality of data blocks at a time, and the plurality of SPCs may be used to concurrently process sub-calculation tasks for a plurality of groups of first number of consecutive data blocks corresponding to a one-dimensional tensor according to the storage order of the data blocks indicated by the predetermined storage layout described above. For example, as shown in FIG. 3, assume that there are 6 SPCs available for performing computation tasks for the input tensor, the 6 SPCs are available for processing of a corresponding set of 6 consecutive data blocks (e.g., data blocks b1-b 6) in a one-dimensional tensor, and for processing of a subsequent set of 6 consecutive data blocks (e.g., data blocks b7-b 12), and so on, until sub-computation tasks for all data blocks corresponding to the one-dimensional tensor are assigned to the SPC for processing. By doing task decomposition in one dimension as described above, a more uniform task allocation can be achieved, making more uniform use of the hardware resources of the GPGPU.

Alternatively, in the data processing method of the present disclosure, based on the actual data processing capability of the hardware resources of the GPGPU, various ways of converting the input tensor into the one-dimensional tensor may be employed, and thus a task allocation method corresponding to these ways is employed. The following will be presented by way of example in three ways of converting an input tensor into a one-dimensional tensor.

Fig. 4 is a schematic diagram illustrating a first example of converting an input tensor into a one-dimensional tensor according to an embodiment of the present disclosure. In fig. 4, a first example of tensor folding of a three-dimensional input tensor is shown, i.e. the case where each thread of the GPGPU reads and processes one data in the input tensor at a time.

According to an embodiment of the disclosure, according to a computing task for the input tensor, determining a task allocation of the computing task on a hardware resource of the GPGPU based on the one-dimensional tensor may include: decomposing the computing task for the input tensor into sub-computing tasks performed on each of a plurality of execution units of the GPGPU based on a position in the input tensor of a data block corresponding to each element in the one-dimensional tensor and the predetermined storage layout, wherein the sub-computing tasks performed on each execution unit may be used to process one element in the one-dimensional tensor; each execution unit may include a plurality of threads, where each thread may be configured to process one data in one data block corresponding to one element in the one-dimensional tensor.

Optionally, when several dimensions of the input tensor are folded into one dimension, the dimension of the memory address with the minimum jitter along with the coordinates may be placed at the innermost layer, so as to improve the access efficiency of the memory. Wherein a smaller jitter of memory addresses corresponds to a smaller difference between addresses of adjacent data elements in the memory, i.e. the storage locations of the data in the memory are more consecutive. Conversely, a larger jitter of memory addresses corresponds to a larger difference between addresses of adjacent data elements in memory, i.e., more discrete storage locations of the data in memory. By placing the dimension of the memory address that minimally jumps with coordinates at the innermost layer of tensor folding and task allocation (e.g., in the case shown in fig. 3, the H dimension of the input tensor is placed at the innermost layer), access of data in memory can be made more continuous. Therefore, when the execution unit accesses data, better data continuity can be realized, thereby reducing delay of access operation and improving utilization rate of access bandwidth.

Alternatively, in the case where each thread of the GPGPU reads and processes only one data in the input tensor at a time, tensor folding may be performed in units of one data in the dimension in which the memory address is the smallest with coordinate jitter as the innermost layer. For example, as shown in fig. 4, in the case of adopting the above-described predetermined storage layout, tensor folding is performed in units of one line in the H dimension with the H dimension of the input tensor as the innermost layer.

As shown in fig. 4, for an input tensor having dimensions n×h×w, the dimension in which the memory address is minimally jumping with coordinates may be the innermost layer, for example, in the case of data storage according to the above-described predetermined storage layout, task allocation may be performed in units of one line in the H dimension with the H dimension as the innermost layer. For example, one row in the H-dimension and several columns in the W-dimension constitute one data block (corresponding to the smallest rectangular block in fig. 4), where several columns in the W-dimension may correspond to the number of threads (warp_size) included by one execution unit, and the data block may correspond to one execution unit (corresponding to the smallest rectangular block in fig. 4). As shown in fig. 4, one execution unit may include warp_size threads, and thus, for dimension W, one line in the H dimension may correspond to w_tile execution units, where w_tile corresponds to a value obtained by rounding up (w+warp_size-1)/warp_size, that is, the number of thread bundles (execution units) to which one unit in the H dimension corresponds. Therefore, in the case of data block division, tensor folding, and task allocation in units of one line in the H dimension and one batch in the N dimension, the H dimension is divided into h_tile=h units, and the N dimension is divided into n_tile=n units.

Thus, for the case described above with reference to fig. 4, the one-dimensional tensor corresponding to the input tensor may include n_tile h_tile w_tile data blocks, and the data blocks may be processed by the corresponding n_tile h_tile w_tile execution units.

Furthermore, considering Burst (Burst) transfer characteristics of the GPGPU, that is, in the GPGPU, the ability to achieve efficient data transfer and computation operations by utilizing hardware optimizations such as data parallelism and instruction level parallelism, and cache and memory bandwidth, in an embodiment of the present disclosure, a case where each thread of the GPGPU reads and processes two data simultaneously, that is, corresponding to the following second example, may also be considered.

According to an embodiment of the disclosure, according to a computing task for the input tensor, determining a task allocation of the computing task on a hardware resource of the GPGPU based on the one-dimensional tensor may include: decomposing the computing task for the input tensor into sub-computing tasks performed on each of a plurality of execution units of the GPGPU based on a location in the input tensor of a data block corresponding to each element in the one-dimensional tensor and the predetermined storage layout, wherein the sub-computing tasks performed on each execution unit may be used to process one element in the one-dimensional tensor, which may correspond to two or more data blocks of the input tensor that are consecutive based on the predetermined storage layout; wherein each execution unit may comprise a plurality of threads, wherein each thread may be configured to process two data from two adjacent ones of the two or more data blocks corresponding to an element of the one-dimensional tensor.

Fig. 5A is a schematic diagram illustrating a second example of converting an input tensor into a one-dimensional tensor according to an embodiment of the present disclosure. Fig. 5B is a schematic diagram illustrating task allocation of hardware resources of a GPGPU based on a second example of converting an input tensor into a one-dimensional tensor, according to an embodiment of the present disclosure.

Alternatively, in the case where each thread of the GPGPU can read and process two data in the input tensor at a time, tensor folding may be performed in units of two data in the dimension in which the memory address is the innermost layer with the smallest jitter with coordinates. For example, as shown in fig. 5A, in the case of adopting the above-described predetermined storage layout, tensor folding is performed in two line units in the H dimension with the H dimension of the input tensor as the innermost layer.

As shown in fig. 5A and 5B, for an input tensor having a dimension of nxh×w, the dimension in which the memory address is smallest in terms of coordinate jitter may be the innermost layer, for example, in the case of data storage according to the above-described predetermined storage layout, task allocation may be performed in units of two rows in the H dimension with the H dimension as the innermost layer. For example, one row in the H dimension and several columns in the W dimension constitute one data block (corresponding to the smallest rectangular block in FIG. 5A), and two rows in the H dimension and several columns in the W dimension constitute one element in a one-dimensional tensor (corresponding to the two consecutive smallest rectangular blocks in FIG. 5A), wherein several columns in the W dimension may correspond to the number of threads (warp_size) included by one execution unit, one element in the one-dimensional tensor may correspond to one execution unit (corresponding to the one-dimensional tensor in FIG. 5A) Two consecutive minimum rectangular blocks). As shown in fig. 5A, one execution unit may include warp_size threads, and thus, for dimension W, two rows in the H dimension may correspond to w_tile execution units, where w_tile corresponds to a value obtained by rounding up (w+warp_size-1)/warp_size, that is, the number of thread bundles (execution units) corresponding to one unit of the H dimension. Thus, in the case of tensor folding and task allocation in units of two rows in the H dimension and one batch in the N dimension, the H dimension is divided intoPart->The representation is rounded up and the N dimension is divided into n_tile=n parts.

For example, as shown in FIG. 5B, assume that there are 6 SPCs available for performing computational tasks for the input tensor, the 6 SPCs are available for processing of a corresponding set of 6 elements (i.e., 6 x 2 consecutive data blocks, e.g., data blocks B1-B12) in a one-dimensional tensor, and for processing of a subsequent set of 6 elements (i.e., 6 x 2 consecutive data blocks, e.g., data blocks B13-B24, not fully shown in FIG. 5B), and so on, until all of the sub-computational tasks of the data blocks corresponding to the one-dimensional tensor are assigned to the SPC for processing.

Thus, for the case described above with reference to fig. 5A, the one-dimensional tensor corresponding to the input tensor may include n_tile h_tile w_tile data blocks, and the data blocks may be processed by the corresponding n_tile h_tile w_tile execution units. By the tensor folding and task allocation method described above with reference to the second example, the hardware resources of the GPGPU can be more fully utilized, improving the computing efficiency.

Further, the memory access efficiency of the method of the present disclosure may be further improved on the basis of the second example described above. Fig. 6A is a schematic diagram illustrating a third example of converting an input tensor into a one-dimensional tensor according to an embodiment of the present disclosure. Fig. 6B is a schematic diagram illustrating task allocation of hardware resources of a GPGPU based on a third example of converting an input tensor to a one-dimensional tensor, according to an embodiment of the present disclosure.

Alternatively, similarly to the second example, in the case where each thread of the GPGPU can read and process two data in the input tensor at a time, tensor folding may be performed in units of four data in the dimension in which the memory address is the smallest with the coordinate jitter as the innermost layer. For example, as shown in fig. 6A, in the case of adopting the above-described predetermined storage layout, tensor folding is performed in four line units in the H dimension with the H dimension of the input tensor as the innermost layer.

As shown in fig. 6A and 6B, for an input tensor having dimensions n×h×w, the dimension where the memory address jitter with coordinates is smallest may be the innermost layer, for example, in the case of data storage according to the above-described predetermined storage layout, task allocation may be performed in units of four rows in the H dimension with the H dimension as the innermost layer. For example, one row in the H-dimension and several columns in the W-dimension constitute one data block (corresponding to the smallest rectangular block in fig. 6A), and four rows in the H-dimension and several columns in the W-dimension constitute one element (corresponding to the four consecutive smallest rectangular blocks in fig. 6A), wherein several columns in the W-dimension may correspond to the number of threads (warp_size) included by one execution unit, and one element in the one-dimensional tensor may correspond to one execution unit (corresponding to the four consecutive smallest rectangular blocks in fig. 6A). Further, one element in the one-dimensional tensor may be divided into two groups (as shown by the grouping of four consecutive minimum rectangular blocks filled with different colors in fig. 6A) for sequential processing (for example, by two execution instructions) when processed by one execution unit. As shown in fig. 6A, one execution unit may include warp_size threads, and thus for dimension W, four rows in the H dimension may correspond to w_tile execution units, where w_tile corresponds to the value obtained by rounding up (w+warp_size-1)/warp-size, i.e., the number of thread bundles (execution units) corresponding to one unit of the H dimension. Thus, in the case of tensor folding and task allocation in units of four rows in the H dimension and one batch in the N dimension, the H dimension is divided into The N dimension is divided into n_tile=n parts.

For example, as shown in FIG. 6B, assume that there are 6 SPCs available for performing computational tasks for the input tensor, the 6 SPCs are available for processing of a corresponding set of 6 elements (i.e., 6 x 4 consecutive data blocks, e.g., data blocks B1-B16) in a one-dimensional tensor, and for processing of a subsequent set of 6 elements (i.e., 6 x 4 consecutive data blocks, e.g., data blocks B17-B32, not shown in FIG. 6B), and so on, until all of the sub-computational tasks of the data blocks corresponding to the one-dimensional tensor are assigned to the SPC for processing.

Thus, for the case described above with reference to fig. 6A, the one-dimensional tensor corresponding to the input tensor may include n_tile h_tile w_tile data blocks, and the data blocks may be processed by the corresponding n_tile h_tile w_tile execution units. By the tensor folding and task allocation method described above with reference to the third example, the coordinates of each data block can be reused for reading of data, and then the data can be accessed in the memory more efficiently, so that the access delay of the data can be reduced and the reuse of the data can be maximized to improve the calculation efficiency.

After completing the task allocation of the computing task for the input tensor, in step S204, the computing task for the input tensor may be performed based on the task allocation of the computing task on the hardware resources of the GPGPU.

According to an embodiment of the disclosure, based on task allocation of the computing task on hardware resources of the GPGPU, performing the computing task for the input tensor may include: the computing tasks for the input tensor are performed by executing respective sub-computing tasks on a plurality of execution units of the GPGPU.

Alternatively, as described above, for computing tasks that are insensitive to memory access order, the data processing method of the present disclosure may decompose the computing task for the input tensor into a plurality of independent sub-computing tasks in one dimension, and reasonably allocate the sub-computing tasks to different execution units in the dimension of one dimension. The input tensor dimension is folded into one dimension according to the minimum alignment granularity of hardware to access, the parallelism can be improved through data parallelism, the memory access efficiency is improved by utilizing the hardware alignment requirement, the memory access delay is reduced, the parallel computing capacity of the GPGPU can be better utilized, and the execution speed of an operator is accelerated. In addition, by performing task allocation in one dimension, more uniform task allocation can be realized, and the utilization rate of the memory bandwidth is improved, so that the utilization of the hardware resources of the GPGPU is more balanced.

Fig. 7 is a schematic diagram illustrating a data processing apparatus 700 according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, the data processing apparatus 700 may include a data acquisition module 701, a tensor folding module 702, a task decomposition module 703, and a task execution module 704.

The data acquisition module 701 may be configured to acquire an input tensor. Alternatively, the data acquisition module 701 may perform the operations described above with reference to step S201.

For example, tensors may be obtained as input to the data processing apparatus of the present disclosure, i.e., as input to a computing task. The input tensor may refer to multi-dimensional array data that needs to be computed on the GPGPU, which may be various types of data, such as images, audio, text, or any other type of data. The dimensions, shape, and data type of the input tensor may depend on the particular application scenario and task requirements. For example, for image processing tasks, the input tensor may typically have three or four dimensions, representing the height, width, number of channels, and batch size of the image, respectively. For audio processing tasks, the input tensor may typically have two or three dimensions, representing the number of sample points, the number of channels, and the batch size of the audio, respectively. For text processing tasks, the input tensor may typically have two dimensions, representing the length of the text and the batch size, respectively.

The tensor folding module 702 may be configured to divide the input tensor into a plurality of data blocks based on a minimum alignment granularity of the GPGPU, and to convert the input tensor into a one-dimensional tensor, wherein each element in the one-dimensional tensor corresponds to a data block, the minimum alignment granularity being used to indicate that a starting address of a storage address of each of the plurality of data blocks in a memory of the GPGPU is aligned based on the minimum alignment granularity. Alternatively, the tensor folding module 702 may perform the operations described above with reference to step S202.

For example, the dimension of the input tensor may be folded according to the minimum alignment granularity defined by the hardware layout of the GPGPU to convert the input tensor into a one-dimensional tensor, where the minimum alignment granularity defined by the hardware layout of the GPGPU may refer to that when the data is accessed in the memory of the GPGPU, the data needs to be stored in a certain byte alignment manner, so as to improve the memory access efficiency. Specifically, the multidimensional input tensor can be reorganized into a one-dimensional array, and the memory alignment is performed according to the hardware layout requirement, wherein the hardware layout defines how to map the dimension of the input tensor to the memory of the GPGPU.

Alternatively, different data blocks divided from the input tensor may be stored in different memory areas, respectively, and the data blocks may be stored in a certain byte alignment. Alternatively, the minimum alignment granularity of the GPGPU may indicate the manner in which the individual data blocks are aligned, and may include, for example, the number of consecutively aligned data blocks and the amount of data that each data block includes.

Alternatively, all data in one data block divided from the input tensor may be stored in the same memory region to make the access of the GPGPU to the memory more continuous.

Alternatively, different data blocks divided from the input tensor may be stored in different memory areas, and the manner in which these different data blocks are stored may be based on their locations in the input tensor and a predetermined storage layout. For example, the predetermined storage layout may indicate an order of storage of the data blocks in the respective dimensions of the input tensor in the memory of the GPGPU, e.g., the respective data blocks in the input tensor are stored layer by layer in an order of H-dimension for the innermost layer, W-dimension for the minor outer layer, and N-dimension for the outermost layer.

The task decomposition module 703 may be configured to determine a task allocation of the computing task on the hardware resources of the GPGPU based on the one-dimensional tensor according to the computing task for the input tensor. Alternatively, the task decomposition module 703 may perform the operations described above with reference to step S203.

For example, in the task decomposition module 703, the following aspects may be considered for task allocation: (1) data parallelism: that is, considering whether a calculation task for an input tensor can be decomposed into a plurality of independent sub-calculation tasks for parallel calculation, if there is no dependency between the sub-calculation tasks and different data blocks can be processed in parallel, the parallelism can be improved by data parallelism; (2) task allocation: namely, the computing tasks are reasonably distributed to different execution units, and the task distribution strategy can be determined according to factors such as the complexity of the tasks, the data dependency relationship, the computing load and the like, for example, the computing-intensive tasks can be distributed to CUs, and the data-intensive tasks can be distributed to EUs; (3) and (3) data transmission: in distributed computing, data may need to be transferred from host memory to GPU memory, or between GPU memories, and therefore, data transfer needs to be managed reasonably, reducing overhead of data transfer, and thus improving execution efficiency of computing tasks.

Alternatively, for each element in the one-dimensional tensor described above, i.e., each data block, a sub-computation task for that data block may be performed using a corresponding SPC in the hardware resources of the GPGPU. Alternatively, for a first number of SPCs available to perform calculation tasks for the input tensor, each SPC may be used to process sub-calculation tasks for a plurality of data blocks at a time, and the plurality of SPCs may be used to concurrently process sub-calculation tasks for a plurality of groups of first number of consecutive data blocks corresponding to a one-dimensional tensor according to the storage order of the data blocks indicated by the predetermined storage layout described above.

Alternatively, in the data processing method of the present disclosure, based on the actual data processing capability of the hardware resources of the GPGPU, various ways of converting the input tensor into the one-dimensional tensor may be employed, and thus a task allocation method corresponding to these ways is employed.

For example, in the case where each thread of the GPGPU reads and processes only one data in the input tensor at a time, tensor folding may be performed in units of one data in the dimension in which the memory address is the innermost layer with the smallest jitter with coordinates.

In addition, considering the burst transfer characteristics of the GPGPU, that is, in the GPGPU, the capability of implementing efficient data transfer and computation operations by utilizing hardware optimization such as data parallelism and instruction level parallelism, and cache and memory bandwidth, in an embodiment of the present disclosure, a case where each thread of the GPGPU reads and processes two data simultaneously may also be considered. For example, in the case where each thread of the GPGPU can read and process two data in the input tensor at a time, tensor folding may be performed in units of two data in the dimension in which the memory address is the innermost layer with the smallest jitter with coordinates.

Further, the memory access efficiency of the method of the present disclosure may be further improved on the basis of the second example described above. For example, in the case where each thread of the GPGPU can read and process two data in the input tensor at a time, tensor folding may be performed in units of four data in the dimension in which the memory address is the innermost layer with the smallest jitter with coordinates.

The task execution module 704 may be configured to execute a computing task for the input tensor based on a task allocation of the computing task on hardware resources of the GPGPU. Alternatively, the task execution module 704 may perform the operations described above with reference to step S204.

After completing task allocation of the computing task for the input tensor, optionally, for the computing task insensitive to the access order, the computing task for the input tensor may be decomposed into a plurality of independent sub-computing tasks in one dimension, and the sub-computing tasks may be reasonably allocated to different execution units in one dimension. The input tensor dimension is folded into one dimension according to the minimum alignment granularity of hardware to access, the parallelism can be improved through data parallelism, the memory access efficiency is improved by utilizing the hardware alignment requirement, the memory access delay is reduced, the parallel computing capacity of the GPGPU can be better utilized, and the execution speed of an operator is accelerated. In addition, by performing task allocation in one dimension, more uniform task allocation can be realized, and the utilization rate of the memory bandwidth is improved, so that the utilization of the hardware resources of the GPGPU is more balanced.

According to yet another aspect of the present disclosure, there is also provided a data processing apparatus. Fig. 8 shows a schematic diagram of a data processing device 2000 according to an embodiment of the present disclosure.

As shown in fig. 8, the data processing device 2000 may include one or more processors 2010 and one or more memories 2020. Wherein said memory 2020 has stored therein computer readable code which, when executed by said one or more processors 2010, can perform a data processing method as described above.

The processor in embodiments of the present disclosure may be an integrated circuit chip having signal processing capabilities. The processor may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, and may be of the X86 architecture or ARM architecture.

In general, the various example embodiments of the disclosure may be implemented in hardware or special purpose circuits, software, firmware, logic, or any combination thereof. Some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device. While aspects of the embodiments of the present disclosure are illustrated or described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that the blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

For example, a method or apparatus according to embodiments of the present disclosure may also be implemented by means of the architecture of computing device 3000 shown in fig. 9. As shown in fig. 9, computing device 3000 may include a bus 3010, one or more CPUs 3020, a Read Only Memory (ROM) 3030, a Random Access Memory (RAM) 3040, a communication port 3050 connected to a network, an input/output component 3060, a hard disk 3070, and the like. A storage device in the computing device 3000, such as a ROM 3030 or a hard disk 3070, may store various data or files for processing and/or communication of the data processing method provided by the present disclosure and program instructions executed by the CPU. The computing device 3000 may also include a user interface 3080. Of course, the architecture shown in FIG. 9 is merely exemplary, and one or more components of the computing device shown in FIG. 9 may be omitted as may be practical in implementing different devices.

According to yet another aspect of the present disclosure, a computer-readable storage medium is also provided. The computer storage medium has computer readable instructions stored thereon. The data processing method according to the embodiments of the present disclosure described with reference to the above figures may be performed when the computer readable instructions are executed by a processor. The computer readable storage medium in embodiments of the present disclosure may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (ddr SDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), synchronous Link Dynamic Random Access Memory (SLDRAM), and direct memory bus random access memory (DRRAM). It should be noted that the memory of the methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory. It should be noted that the memory of the methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

Embodiments of the present disclosure also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. A processor of a computer device reads the computer instructions from a computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs a data processing method according to an embodiment of the present disclosure.

Embodiments of the present disclosure provide a data processing method, apparatus, device, computer program product, and computer readable storage medium.

It is noted that the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises at least one executable instruction for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The exemplary embodiments of the present disclosure described in detail above are illustrative only and are not limiting. Those skilled in the art will understand that various modifications and combinations of these embodiments or features thereof may be made without departing from the principles and spirit of the disclosure, and such modifications should fall within the scope of the disclosure.

Claims

1. A data processing method, comprising:

acquiring an input tensor;

dividing the input tensor into a plurality of data blocks based on a minimum alignment granularity of a General Purpose Graphics Processing Unit (GPGPU), and converting the input tensor into a one-dimensional tensor, wherein each element in the one-dimensional tensor corresponds to one data block, and the minimum alignment granularity is used for indicating that a starting address of a storage address of each data block in the plurality of data blocks in a memory of the GPGPU is aligned based on the minimum alignment granularity;

determining task allocation of the computing task on hardware resources of the GPGPU based on the one-dimensional tensor according to the computing task aiming at the input tensor; and

and executing the computing task aiming at the input tensor based on task allocation of the computing task on the hardware resource of the GPGPU.

2. The method of claim 1, wherein, for a plurality of memory regions in memory of the GPGPU for storing the input tensor, data in a data block corresponding to one element of the one-dimensional tensor is stored in one of the plurality of memory regions, and

for different elements in the one-dimensional tensor, different data blocks corresponding to the different elements are stored in one or more of the plurality of memory regions based on their locations in the input tensor and a predetermined storage layout indicating a manner in which data blocks in different dimensions of the input tensor are arranged in memory of the GPGPU.

3. The method of claim 2, wherein converting the input tensor into a one-dimensional tensor comprises:

the plurality of data blocks are arranged in the predetermined storage layout based on the position of each of the plurality of data blocks in the input tensor to convert the input tensor into the one-dimensional tensor.

4. The method of claim 3, wherein determining a task allocation of the computing task on hardware resources of the GPGPU based on the one-dimensional tensor according to the computing task for the input tensor comprises:

Decomposing the computing task for the input tensor into sub-computing tasks performed on each of a plurality of execution units of the GPGPU based on a position in the input tensor of a data block corresponding to each element in the one-dimensional tensor and the predetermined storage layout, wherein the sub-computing tasks performed on each execution unit are used to process one element in the one-dimensional tensor;

each execution unit comprises a plurality of threads, wherein each thread is used for processing one data in one data block corresponding to one element in the one-dimensional tensor.

5. The method of claim 3, wherein determining a task allocation of the computing task on hardware resources of the GPGPU based on the one-dimensional tensor according to the computing task for the input tensor comprises:

decomposing the computing task for the input tensor into sub-computing tasks performed on each of a plurality of execution units of the GPGPU based on a position in the input tensor of a data block corresponding to each element in the one-dimensional tensor and the predetermined storage layout, wherein the sub-computing tasks performed on each execution unit are for processing one element in the one-dimensional tensor, the one element corresponding to two or more data blocks of the input tensor that are consecutive based on the predetermined storage layout;

Each execution unit comprises a plurality of threads, wherein each thread is used for processing two data, and the two data are respectively from two adjacent data blocks in two or more data blocks corresponding to one element in the one-dimensional tensor.

6. The method of claim 4 or 5, wherein performing the computing task for the input tensor based on a task allocation of the computing task on hardware resources of the GPGPU comprises:

the computing tasks for the input tensor are performed by executing respective sub-computing tasks on a plurality of execution units of the GPGPU.

7. A data processing apparatus comprising:

a data acquisition module configured to acquire an input tensor;

a tensor folding module configured to divide the input tensor into a plurality of data blocks based on a minimum alignment granularity of a general purpose graphics processing unit GPGPU, and to convert the input tensor into a one-dimensional tensor, wherein each element in the one-dimensional tensor corresponds to a data block, the minimum alignment granularity being used to indicate that a starting address of a storage address of each of the plurality of data blocks in a memory of the GPGPU is aligned based on the minimum alignment granularity;

A task decomposition module configured to determine a task allocation of a computing task on hardware resources of the GPGPU based on the one-dimensional tensor according to the computing task for the input tensor; and

and a task execution module configured to execute a computing task for the input tensor based on task allocation of the computing task on hardware resources of the GPGPU.

8. The apparatus of claim 7, wherein, for a plurality of memory regions in memory of the GPGPU for storing the input tensor, data in a data block corresponding to one element of the one-dimensional tensor is stored in one of the plurality of memory regions, and

9. A data processing apparatus comprising:

One or more processors; and

one or more memories in which a computer executable program is stored which, when executed by the processor, performs the method of any of claims 1-6.

10. A computer readable storage medium having stored thereon computer executable instructions for implementing the method of any of claims 1-6 when executed by a processor.