CN117634711A

CN117634711A - Tensor dimension segmentation method, system, device and medium

Info

Publication number: CN117634711A
Application number: CN202410100952.4A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Bi Ren Technology Co ltd; Beijing Bilin Technology Development Co ltd
Current assignee: Shanghai Bi Ren Technology Co ltd; Beijing Bilin Technology Development Co ltd
Priority date: 2024-01-25
Filing date: 2024-01-25
Publication date: 2024-03-01
Anticipated expiration: 2044-01-25
Also published as: CN117634711B

Abstract

Provided are a tensor dimension splitting method, a tensor dimension splitting system, an electronic device and a non-transitory storage medium. The tensor dimension segmentation method comprises the following steps: dividing one or more selected dimensions according to dividing granularity of each dimension of the tensor to obtain tensor block numbers after dividing the one or more selected dimensions; determining whether splitting the selected one or more dimensions is the best splitting strategy according to the tensor block number and the number of a plurality of available computing units.

Description

Tensor dimension segmentation method, system, device and medium

Technical Field

The present application relates to the field of computer information processing, and more particularly to a tensor dimension splitting method, a tensor dimension splitting system, an electronic device, and a non-transitory storage medium.

Background

In recent years, computers have involved convolution operations or matrix multiplication operations of two-dimensional or higher-dimensional tensors in processing deep learning neural networks. However, in actual operation, the dimensions such as the height or width of the tensor may exceed the relevant limits of the hardware accelerator, or in order to use multiple computing units to calculate the convolution operation or the matrix multiplication operation for the tensor in parallel, a certain dimension of the tensor needs to be segmented (or called split).

Disclosure of Invention

According to one aspect of the present application, there is provided a tensor dimension splitting method, including: dividing one or more selected dimensions according to dividing granularity of each dimension of the tensor to obtain tensor block numbers after dividing the one or more selected dimensions; determining whether the segmentation of the selected one or more dimensions is an optimal segmentation strategy according to the tensor block number and the number of a plurality of available parallel computing units.

According to another aspect of the present application, there is provided a tensor dimension splitting system, including: the segmentation device is configured to segment the selected one or more dimensions according to the segmentation granularity of each dimension of the tensor to obtain the tensor block number after the selected one or more dimensions are segmented; a determining means configured to determine whether splitting the selected one or more dimensions is the best splitting strategy based on the tensor partition number and the number of the plurality of available parallel computing units.

According to another aspect of the present application, there is provided an electronic device including: a memory for storing instructions; and a processor for reading the instructions in the memory and performing a method according to an embodiment of the present application.

According to another aspect of the present application, there is provided a non-transitory storage medium having instructions stored thereon, wherein the instructions, when read by a processor, cause the processor to perform a method according to an embodiment of the present application.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.

Fig. 1 shows a usage scenario diagram according to an embodiment of the present application.

Fig. 2 shows a flow chart of a tensor dimension splitting method according to an embodiment of the present application.

Fig. 3 shows a flow chart for selecting one or more dimensions in order of the number of dimensions from small to large according to the present application.

Fig. 4A and 4B show examples of considering matching of a slicing strategy to another tensor to which the operator operation is to be performed and an output tensor in case that the operator operation is a convolution forward-activated propagation (Convolution Forward, conv FWD), respectively.

Fig. 5A and 5B show examples of considering matching of a slicing strategy to another tensor to which the operator operation is to be performed and an output tensor in case the operator operation is a convolution back-activated propagation (Convolution Backward propagation Activation, conv BPA), respectively.

Fig. 6A, 6B, and 6C show examples of considering matching of a slicing strategy to another tensor to which the operator operation is to be performed and an output tensor in the case where the operator operation is convolution back weight propagation (Convolution Backward Propagation Weight, conv BPW), respectively.

Fig. 7A and 7B show examples of considering matching of a slicing strategy for another tensor to be subjected to an operator operation with the tensor and an output tensor, respectively, in the case where the operator operation is matrix multiply add (Matrix multiply and accumulate, MMA) without transpose (transfer).

Fig. 8A and 8B show examples of consideration of matching of a segmentation strategy of another tensor and an output tensor to be subjected to an operator operation with the tensor in the case where the operator operation is MMA band transposition (transfer), respectively.

Fig. 9A and 9B show examples of a segmentation strategy considering that the input tensor matches the output tensor in the case where the operator operation is a fusion operator of MMA plus BIAS (BIAS), respectively.

Fig. 10 shows a block diagram of a tensor dimension splitting system according to an embodiment of the present application.

FIG. 11 illustrates a block diagram of an exemplary electronic device suitable for use in implementing embodiments of the present application.

FIG. 12 shows a schematic diagram of a non-transitory computer-readable storage medium according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to the specific embodiments of the present application, examples of which are illustrated in the accompanying drawings. While the present application will be described in conjunction with the specific embodiments, it will be understood that it is not intended to limit the present application to the described embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the application as defined by the appended claims. It should be noted that the method steps described herein may be implemented by any functional block or arrangement of functions, and any functional block or arrangement of functions may be implemented as a physical entity or a logical entity, or a combination of both.

In the related art, when a certain dimension of a tensor is segmented, the segmentation dimension is generally limited, mainly in a batch (batch) dimension, and a plurality of segmented tensors are distributed to different computing units so as to be processed in parallel by the different computing units, when the batch dimension is smaller and the number of computing units is larger, a large number of computing units are in an idle state, so that the utilization rate of hardware computing units is very low. In the related technology, the bottom layer operator does not support flexible segmentation, only a single dimension can be segmented, multiple dimensions cannot be segmented simultaneously, and the segmentation strategy is simple and cannot consider different use scenes. And when the plurality of segmented tensors are processed in parallel by using the plurality of computing units, since each computing unit only processes a part of the tensors of the parameters of the neural network model, additional communication is required in both forward and backward propagation, and the computing efficiency of model parallelism is low.

The disclosure proposes a tensor dimension segmentation method capable of supporting any dimension of a segmentation tensor to be processed in parallel by multiple computing units, which can automatically infer one or more dimensions of the segmentation according to different tensor shapes, layouts (including dimensions and sizes) and different scenes of different computing unit numbers so as to segment the one or more dimensions of the tensor to be transmitted to the multiple computing units for parallel processing, and the implementation of the whole strategy can be automatically processed by an automatic handler manager without perception of a user. Moreover, according to the operator operation scene, on one hand, all tensors for carrying out the operator operation can be ensured to be cut according to the same rule, and on the other hand, the shape transmitted to the next-layer operator is the shape of the tensor after the cutting, and for the next-layer operator, the workload required to be adapted is less.

As shown in fig. 1, the tensor dimension splitting method according to an embodiment of the present application may be applied on an automated handler manager (pass manager). In a use scenario of an embodiment of the present application, the automated handler manager 110 may receive a user-defined tensor slicing policy entered by a user. For example, the user may enter his custom cut in which dimension or dimensions of the tensor. The automated handler manager 110 may confirm whether the custom slicing strategy is the best slicing strategy. If the situation that the user-defined tensor segmentation strategy is received is set, the automatic handler manager 110 does not determine the optimal segmentation strategy any more, and the user-defined segmentation strategy can also be directly set to be output so as to apply the user-defined segmentation strategy.

If the user does not input a custom tensor cut policy, the automated handler manager 110 may also automatically determine the optimal cut policy based on the shape and/or layout of the input tensor and the number of available computing units, i.e., in which dimension or dimensions of the tensor the cut can take full advantage of all available computing units 120 to proceed with the operator elements in parallel.

The optimal segmentation policy determined by the automated handler manager 110 may be uniformly applied to a corresponding matrix computation core (Tcore) operator or vector computation core (Vcore) operator or a fusion operator of both.

The automatic handler manager 110 determines that the optimal segmentation strategy can adapt to hardware types of different computing units 120, so that the adaptation efficiency of different numbers of computing unit processing operators under the same processor architecture is solved, the computing units 120 are adapted to the maximum, and the computing utilization rate is improved.

The plurality of computation units 120 may perform operator operations in parallel and may therefore be referred to as parallel computation units. The computation unit 120 may comprise a stream processor cluster (Stream Processor Cluster, SPC) or other computation unit that may process operator elements in parallel.

As shown in fig. 2, the tensor dimension splitting method 200 includes: step 210, segmenting the selected one or more dimensions according to the segmentation granularity of each dimension of the tensor to obtain the segmented tensor block number of the selected one or more dimensions; step 220, determining whether the segmentation of the selected one or more dimensions is the best segmentation strategy according to the tensor partition number and the number of the plurality of available parallel computing units.

In this way, according to the number of tensor blocks obtained after each dimension of the tensor is segmented and the number of available computing units, whether the segmentation of the selected dimension or dimensions is the optimal segmentation strategy can be determined more accurately, so that the available computing units are used more fully to perform the operator operation of tensor blocks in parallel.

Note that reference herein to an "available parallel computing unit" refers to a computing unit (e.g., a computing core) that is generally available, and because the available computing units may typically execute threads, etc., independently and in parallel, is referred to herein as an "available parallel computing unit" rather than limiting the computing unit to have the name "parallel" or use to be an "available parallel computing unit".

In step 210, the selected one or more dimensions are segmented according to the segmentation granularity of each dimension of the tensor, so as to obtain the tensor block number after the selected one or more dimensions are segmented.

Here, the slicing granularity refers to a slicing step when slicing one dimension, that is, a size of each dimension slice after slicing. For tensors of different layouts, a segmentation granularity is set for each dimension of the tensor.

Table 1 shows the cut granularity for each dimension of the 4-dimensional tensor (e.g., the activation tensor for four dimensions including Batch (N), height (H), width (W), and Channel (C)).

TABLE 1

Table 2 shows the cut granularity for each dimension of a 3-dimensional matrix tensor, e.g., a matrix tensor comprising three dimensions of Batch (N), height (H), and Width (Width, W).

TABLE 2

The layout of the tensors and the granularity of the segmentation of each dimension of the tensors described above are examples and not limiting, and in practice other granularity of segmentation of each dimension of other tensor layouts are also available.

How one or more dimensions are selected for segmentation will also be described in detail later herein.

In one embodiment, the step 210 of dividing the selected one or more dimensions according to the dividing granularity of each dimension of the tensor to obtain the tensor block number after dividing the selected one or more dimensions includes: dividing the size of each dimension of the selected one or more dimensions by the segmentation granularity of the dimension to obtain a dimension segmentation number of the dimension; multiplying all respective dimension cut numbers of the selected one or more dimensions to obtain a tensor block number of the one or more dimensions after being cut.

For example, assuming that one three-dimensional tensor [4,4096,8192] has 8192 parallel computing units, two dimensions of height and width of the three-dimensional tensor are selected, dividing the size of each of the two dimensions selected by the segmentation granularity of the dimension to obtain a dimension segmentation score of the dimension, where the dimension segmentation score of the height dimension is 4096/64=64, and the dimension segmentation score of the width dimension is 8192/64=128. Then dividing the tensor by both the height and width dimensions of the three-dimensional tensor yields 64 x 128 = 8192 tensor segments, each of size 4, 64, 64. Assuming that the number of parallel computing units is 16, selecting a dimension of the height of the three-dimensional tensor, dividing the dimension of the selected dimension of the height by the segmentation granularity of the dimension to obtain a dimension segmentation number of the dimension, and obtaining a dimension segmentation number of the dimension of the height to be 4096/64=64. Then slicing the tensor in the height dimension of the three-dimensional tensor results in 64 tensor segments, each of size 4, 64, 8192.

In step 220, it is determined whether splitting the selected one or more dimensions is the best splitting strategy based on the number of tensor splits and the number of the plurality of available parallel computing units.

In one embodiment, step 220 may include: based on the number of tensor segments, if the number of tensor segments maximizes the utilization of the number of available parallel computing units, then it is determined that the segmentation of the selected one or more dimensions is the optimal segmentation strategy. Step 220 may further include: and performing operator operation of a plurality of tensor blocks obtained after the tensor is segmented according to an optimal segmentation strategy by using an available parallel computing unit.

Here, the optimal segmentation strategy that considers the number of tensor blocks to maximize the utilization of the number of available parallel computing units is such that: each available parallel computing unit can process the same number of tensor blocks at the same time, so that each available parallel computing unit is not idle at a certain moment; or to try to enable each available parallel computing unit to process substantially the same number of tensor segments at the same time, minimizing free available parallel computing units.

In one embodiment, determining that splitting the selected one or more dimensions is the best splitting strategy based on the number of tensor splits if the number of tensor splits maximizes utilization of the number of available parallel computing units comprises: in response to the tensor chunk number being divisible by the number of available parallel computing units, determining that slicing the selected one or more dimensions is the best slicing strategy. If, in response to the tensor chunk number not being divisible by the number of available parallel computing units, the other one or more dimensions are selected to determine whether the tensor chunk number after the selected other one or more dimensions are cut is divisible by the number of available parallel computing units. In response to the number of tensor segments of the selected other dimension or dimensions being split being divisible by the number of available parallel computing units, determining that splitting the selected other dimension or dimensions is the best splitting strategy. In response to all selections of one or more of the dimensions failing to enable the tensor chunk number to be divisible by the number of available parallel computing units, determining the one or more dimensions that maximize the remainder of dividing the tensor chunk number by the number of available parallel computing units or the number of available parallel units needed for the tensor chunk number to be greatest is the optimal segmentation strategy.

For example, in the divisible example, as in the previous example, the tensor [4,4096,8192] is segmented according to two dimensions of height and width of the three-dimensional tensor to obtain 64×128=8192 tensor blocks, and the number of available parallel computing units is 32, so that 8192 is divided by 32 and divisible, and the quotient is 256, and it is determined that the segmentation according to two dimensions of height and width of the three-dimensional tensor is the optimal segmentation strategy.

The operator operation of the plurality of tensor blocks obtained after the tensor is segmented according to the optimal segmentation strategy by using the available parallel computing unit comprises the following steps: and (3) carrying out operator operation of tensor blocks of which the number is the quotient of the tensor blocks divided by the number of the available parallel computing units in parallel by utilizing each available parallel unit. That is, the operator operations for 256 tensor blocks are performed with each of the 32 available parallel units. In this way, each available parallel computing unit is enabled to process the same number of tensor partitions simultaneously, each available parallel computing unit not being idle at a time.

For example, in the case where division is not possible, as in the previous example, the tensor [4,4096,8192] is divided into 64×128=8192 tensor blocks according to two dimensions of height and width of the three-dimensional tensor, and if the number of available parallel computing units is 36, then the 8192 divided by 36 cannot be divided, and the other dimension or dimensions are selected to determine whether the number of tensor blocks after the selected other dimension or dimensions are divided can be divided by the number of available parallel computing units, and if so, it is determined that dividing the selected other dimension or dimensions is the best dividing strategy.

And so on, until all choices of one or more of the dimensions fail to enable the tensor block number to be divided by the number of available parallel computing units, determining the one or more dimensions with the largest remainder or the largest number of available parallel units required to enable the tensor block number to be divided by the number of available parallel computing units is the optimal segmentation strategy.

In this case, the operator operations of the tensor blocks of the quotient having the tensor block number divided by the number of the available parallel computing units are performed in parallel by each of the available parallel computing units, and then the operator operations of the remaining tensor blocks are performed in parallel by the available parallel units of the remainder having the tensor block number divided by the number of the available parallel computing units.

For example, the tensor [4,4096,8192] is segmented according to the batch, height and width of the three-dimensional tensor (wherein the segmentation granularity is 1, 64 and 64 in table 2 respectively) to obtain 4 tensor segments segmented according to batch and dimension at a time, 64 tensor segments segmented according to one dimension of the height and 128 tensor segments segmented according to one dimension of the width. 4×64=256 tensor segments segmented by two dimensions of batch and height, 4×128=512 tensor segments segmented by two dimensions of batch and width, 64×128=8192 tensor segments obtained by segmentation by two dimensions of height and width, 32768 tensor segments obtained by segmentation by three dimensions of batch, height and width. It is assumed that the number of available parallel computing units is 36.

In the case where it is determined that the segmentation is performed in one or more dimensions having the largest remainder of dividing the number of tensor segments by the number of available parallel computing units is the optimal segmentation strategy, it is determined that the quotient obtained by dividing 64 tensor segments segmented in one dimension by 36 available parallel computing units is 1, and the remainder 28 is the largest, it is determined that the segmentation in one dimension is the optimal segmentation strategy.

In this case, the operator operations of 1 tensor block are performed with each of the 36 available parallel computing units, that is, the operator operations of 36 tensor blocks are performed in parallel with the 36 available parallel computing units, and then the operator operations of the remaining 28 tensor blocks are performed in parallel with 28 available parallel units of the 36 available parallel computing units.

In case that it is determined that the segmentation is the best segmentation strategy to make the one or more dimensions that maximize the number of available parallel units needed for the tensor segmentation, the number of parallel computing units needed may be calculated here from the number of segments. The specific formula is: the number of parallel computing units required = ceil (tensor block number/number of available parallel computing units)), where ceil represents a ceiling. And determining the number of the tensor blocks so that the segmentation of one or more dimensions with the largest number of the required parallel computing units is the optimal segmentation strategy, and parallelly carrying out the operator operation of a plurality of tensor blocks obtained by segmenting the tensor according to the optimal segmentation strategy by utilizing the largest number of the parallel computing units.

If 4 tensor blocks obtained by dividing in batch and dimension at a time are substituted into the formula, the number of required parallel computing units is 4. The number of the required parallel computing units obtained by the calculation according to the formula is 32. The number of the parallel computing units needed by the calculation of the formula is also 32. Then it may be determined that segmentation in either the height dimension or the width dimension is the best segmentation strategy.

In this way, each available parallel computing unit can process the same number of tensor blocks at the same time as much as possible, so that each available parallel computing unit is not idle at a certain moment; or to try to enable each available parallel computing unit to process substantially the same number of tensor segments at the same time, minimizing the free available parallel computing units.

Next, a description is presented of how one or more dimensions are selected to make the determination of the optimal segmentation strategy as in embodiments of the present application.

In one embodiment, the selected one or more dimensions may be selected by a user as the selected one or more dimensions. This is the manner in which the user-defined multiple computing units receiving user input split policies shown in FIG. 1. In this case, it may be determined whether the manner in which the user-defined multi-computing unit splitting policy is the optimal splitting policy.

In one embodiment, the selected one or more dimensions may automatically finalize the optimal segmentation strategy by randomly selecting one or more of the dimensions at a time as the selected one or more dimensions.

In one embodiment, to segment as fewer dimensions as possible, each of the dimensions may be selected in order of the number of dimensions from small to large, then the various combinations of two of the dimensions are selected, and then the various combinations of three of the dimensions are selected until all of the dimensions are selected. So that as few dimensions as possible can be sliced to find the best slicing strategy. The fewer dimensions of the cut means that the complexity of the cut and the complexity of the subsequent operator calculations is reduced.

As shown in fig. 3, in step 310, the shape of the tensor, for example, a four-dimensional tensor (N, C, H, W) is input. In step 320, the number of dimension slices obtained after each dimension is sliced according to the slicing granularity is calculated. Then, one or more dimensions are selected in order of the number of dimensions from small to large, starting with one dimension and proceeding from four dimensions.

Specifically, in step 330, each dimension of the dimensions is selected, it is determined whether the number of tensor segments obtained after each dimension division is divisible by the number of available parallel computing units, that is, for example, it is determined whether the number of tensor segments obtained after N dimension division is divisible by the number of available parallel computing units, then it is determined whether the number of tensor segments obtained after C dimension division is divisible by the number of available parallel computing units, then it is determined whether the number of tensor segments obtained after H dimension division is divisible by the number of available parallel computing units, and then it is determined whether the number of tensor segments obtained after W dimension division is divisible by the number of available parallel computing units. If the number of tensor segments obtained after any dimension segmentation is divisible by the number of available parallel computing units, then step 380 is entered to determine the current segmentation strategy as the optimal segmentation strategy.

If the splitting of each dimension cannot make the split tensor block number divisible by the number of available parallel computing units, step 340 is performed, various combinations of two dimensions in each dimension are selected to split to obtain the split tensor block number, whether the split tensor block number can be divisible by the number of available parallel computing units is determined, and if some combination of two dimensions in each dimension makes the split tensor block number obtained by splitting the two dimensions divisible by the number of available parallel computing units, step 380 is performed to determine that the current splitting strategy is the optimal splitting strategy.

If the splitting of all the combinations of the two dimensions in each dimension cannot make the split tensor block number divisible by the number of available parallel computing units, step 350 is entered, various combinations of the three dimensions in each dimension are selected to split to obtain the split tensor block number, it is determined whether the split tensor block number can be divisible by the number of available parallel computing units, if some combination of the three dimensions in each dimension makes the split tensor block number obtained by splitting the three dimensions divisible by the number of available parallel computing units, step 380 is entered, and the current splitting strategy is determined to be the optimal splitting strategy.

If the splitting of all combinations of three dimensions in each dimension cannot enable the split tensor block number to be divided by the number of available parallel computing units, step 360 is entered, all four dimensions in each dimension are selected for splitting to obtain the split tensor block number, whether the split tensor block number can be divided by the number of available parallel computing units is determined, and if the split tensor block number obtained after splitting all four dimensions can be divided by the number of available parallel computing units, step 380 is entered, and the current splitting strategy is determined to be the optimal splitting strategy.

If the segmentation of all four dimensions also fails to result in the segmented tensor chunk number being evenly divided by the number of available parallel computing units, then step 370 is entered where it is determined that it is the combination of one or more dimensions that maximizes the remainder of the tensor chunk number divided by the number of available parallel computing units to perform the segmentation that is the optimal segmentation strategy.

The above examples illustrate sequential examples of selecting a dimension containing tensors of four dimensions and determining the best slicing strategy, and those skilled in the art can easily derive sequential examples of selecting a dimension containing tensors of three dimensions and determining the best slicing strategy.

Considering that each dimension is different in size according to the dimension cut score after the cutting granularity is cut, the larger the dimension cut score is, the more the parallel computing units can be fully utilized, so if the dimension which is cut according to the cutting granularity and is larger in dimension cut score is selected to cut, and whether the tensor block number after the cutting is divided by the number of the available parallel computing units is determined, the optimal cutting strategy can be obtained faster so that the parallel computing units can be fully utilized.

Therefore, in one embodiment, to obtain the optimal segmentation strategy faster, each of the dimensions may be selected first, then each combination of two of the dimensions is selected, and then each combination of three of the dimensions is selected until all of the dimensions are selected, in order of the number of dimensions from small to large and in order of the number of dimension slices after the dimensions are segmented from large to small.

For example, and also in accordance with the example of fig. 3, in each of steps 330, 340, 350, 360 the order of selecting the dimensions is selected in order of the number of dimension slices after each dimension has been sliced from large to small, e.g., in step 330, the dimension with the largest number of dimension slices is selected to determine whether the number of sliced tensor segments is divisible by the number of available parallel computing units to determine the optimal slicing strategy, then the dimension with the second largest number of dimension slices is selected in turn, and so on. For example, in step 340, a combination of the two dimensions with the largest number of dimension slices (i.e., the dimension with the largest number of dimension slices and the dimension with the second largest number of dimension slices) is first selected to determine whether the segmented tensor block number can be divided by the number of available parallel computing units to determine the optimal segmentation strategy, and so on. For example, in step 340, a combination of the three dimensions with the largest number of dimension slices (i.e., the largest dimension with the second largest number of dimension slices and the third largest dimension with the number of dimension slices) is first selected to determine whether the segmented tensor partition number can be divided by the number of available parallel computing units to determine the optimal segmentation strategy, and then so on. In this way, an optimal segmentation strategy can be obtained faster to more fully utilize the parallel computing units.

In this way, the tensor dimension segmentation method capable of supporting the multi-calculation unit parallel strategy of automatically making and segmenting any dimension can automatically infer one or more dimensions of segmentation according to different tensor shapes, layouts (including dimensions and sizes) and different scenes of different calculation unit numbers so as to segment one or more dimensions of the tensor and convey the segmented dimensions to the plurality of calculation units for parallel processing, and the implementation of the whole strategy can be automatically processed through an automatic processing program manager, so that a user does not need to perceive.

After selecting the optimal segmentation strategy for one or more dimensions for the dimension of a tensor, the operator operation may also consider the optimal segmentation strategy to match the tensor with the segmentation of other tensors of the tensor, so in one embodiment, the tensor dimension segmentation method may further include: after determining that slicing the selected one or more dimensions is the optimal slicing strategy, according to the operator operation of the tensor, the optimal slicing strategy is also applied to the corresponding one or more dimensions of the output tensor and the other tensor subjected to the operator operation of the tensor. In this way, the optimal segmentation strategy of one or more dimensions of the tensor can be matched with the segmentation strategy of the corresponding one or more dimensions of the other tensor to be subjected to operator operation and the output tensor, so that the operator operation of the segmented tensor can be correctly performed.

Depending on the different operator operations, it may be considered to match the segmentation strategy of the output tensor with another tensor to which the operator operation is to be performed.

Specifically, some operator operation scenarios are exemplified below to determine which tensors also apply the best slicing strategy.

As shown in fig. 4A, for example, when determining the cut batch dimension for the tensor x, then the result tensor y (x) after the first conv_fwd operation is also the cut batch dimension, and the result tensor y after the second conv_fwd operation is also the cut batch dimension.

As shown in fig. 4B, for example, when determining the split O-dimension and the C-dimension (OC-dimension) for the weight tensor weight, the result tensor y (x) after the first conv_fwd operation is also the split I-dimension and the C-dimension (IC-dimension), and since the tensor y (x) is split the IC-dimension, the second conv_fwd operator operation is performed, and the tensor weight that needs to perform the operator operation with the tensor y (x) also splits the corresponding dimensions, i.e., the split O-dimension and the C-dimension (OC-dimension) of the tensor weight, and the result tensor y after the second conv_fwd operation is the split I-dimension and the C-dimension (IC-dimension).

As shown in fig. 5A, for example, when determining the cut batch dimension for tensor dy, then the resulting tensor x (dy) after the first conv_bpa operation is also the cut batch dimension, and the resulting tensor x after the second conv_bpa operation is also the cut batch dimension.

As shown in fig. 5B, for example, when determining the split O-dimension and C-dimension (OC dimension) for the tensor dy and also splitting the O-dimension and C-dimension (OC dimension) for the weight tensor weight, the result tensor x (dy) after the first conv_bpa operation is the split output channel (IC) dimension, and since the tensor x (dy) is split into the IC dimensions, the second conv_bpa operator operation is performed, the tensor weight that needs to perform the operator operation with the tensor x (dy) is required to split the corresponding dimensions, i.e., split the O-dimension and the C-dimension (OC dimension) of the tensor weight, so that the correct result can be obtained after the second conv_bpa operation.

As shown in fig. 6A, for both inputs of conv_bpw (tensor x and tensor dy), if the batch dimension is determined to be cut for tensor x, the batch dimension is also cut for tensor dy. As shown in fig. 6B, if the O-dimension and the C-dimension (OC dimension) are segmented for tensor dy, the O-dimension and the C-dimension (OC dimension) are also segmented for the output weight of conv_bpw operator of x and dy. As shown in fig. 6C, if the I-dimension and the C-dimension (IC-dimension) are segmented for tensor x, the output weight of conv_bpw operator for x and dy is also segmented for I-dimension and C-dimension (IC-dimension).

As shown in fig. 7A, for two input tensors a and B of MMA, if a segmentation dimension M is determined for tensor a, then the resultant tensor C passed after the MMA operation is also the segmentation dimension M.

As shown in fig. 7B, for two input tensors a and B of MMA, if a segmentation dimension N is determined for tensor B, then the resulting tensor C passed after the MMA operation is also the segmentation dimension N.

As shown in fig. 8A, when the tensor a is transposed, if the cut dimension M is determined for the tensor a, although this is the 5 th dimension of the tensor a, the resultant tensor C transferred after passing through the MMA operation is also the cut dimension M, which is the 4 th dimension of the tensor C.

As shown in fig. 8B, when the tensor B is transposed, if the cut dimension N is determined for the tensor B, although this is the 4 th dimension of the tensor B, the resultant tensor C delivered after the MMA operation is also the cut dimension N, which is the 5 th dimension of the tensor C.

As shown in fig. 9A, if the Matrix (Matrix) tensor of MMA operator output (e.g., the Matrix tensor of [2,256,112 ]) splits the W dimension (i.e., that dimension of 112), if the Linear (Linear) tensor to ADD to the ADD operator is a column vector, then the Linear tensor needs to be split. If the Linear (Linear) tensor to be added by the ADD operator is a line vector, then the Linear tensor does not need to be sliced.

As shown in fig. 9B, if the Matrix (Matrix) tensor of MMA operator output (e.g., the Matrix tensor of [2,256,112 ]) splits the H dimension (i.e., that of 112), then the Linear (Linear) tensor to ADD to the ADD operator need not split if it is a column vector. If the Linear (Linear) tensor to be added by the ADD operator is a line vector, then the Linear tensor needs to be sliced.

The above examples are all the segmentation strategies set for enabling the segmented tensor to smoothly perform the subsequent operator operation, but the segmentation strategies are not limited thereto, and corresponding segmentation strategies may be set for enabling the segmented tensor to smoothly perform the subsequent operator operation in the scenes of other operator operations and fusion operator operations.

Therefore, on one hand, according to the operator operation scene, all tensors for carrying out the operator operation can be ensured to be cut according to the same rule, and on the other hand, the shape transmitted to the next-layer operator is the shape of the tensor after the cutting, and the next-layer operator has smaller adapting effort.

In summary, various embodiments of the present disclosure can support a tensor dimension splitting method for automatically formulating a multi-computing unit parallel policy for splitting any dimension, and can automatically infer one or more dimensions of splitting according to different tensor shapes, layouts (including dimensions and sizes) and different scenes of different computing unit numbers, so as to split one or more dimensions of the tensor to send the split one or more dimensions of the tensor to a plurality of computing units for parallel processing, and implementation of the whole policy can be automatically processed through an automatic handler manager, so that a user does not need to perceive. Moreover, according to the operator operation scene, the method and the device can ensure that all tensors for carrying out operator operation are segmented according to the same rule, and on the other hand, the shape transmitted to the next-layer operator is the shape of the tensor after segmentation, and for the next-layer operator, the adaptation effort is small.

Fig. 10 shows a block diagram of a tensor dimension splitting system 1000 according to an embodiment of the present application.

As shown in fig. 10, the tensor dimension splitting system 1000 includes: a splitting device 1010 configured to split the selected one or more dimensions according to the splitting granularity of each dimension of the tensor, to obtain a tensor block number after the selected one or more dimensions are split; a determining means 1020 configured to determine whether splitting the selected one or more dimensions is the best splitting strategy based on the number of tensor splits and the number of the plurality of available parallel computing units.

In one embodiment, the determining means 1020 is configured to: determining that splitting the selected one or more dimensions is the optimal splitting strategy if the number of tensor splits can maximize the utilization of the number of available parallel computing units according to the number of tensor splits; and performing operator operation of a plurality of tensor blocks obtained after the tensor is segmented according to an optimal segmentation strategy by using an available parallel computing unit.

In one embodiment, the determining means 1020 is configured to: determining that splitting the selected one or more dimensions is the optimal splitting strategy in response to the tensor chunk number being divisible by the number of available parallel computing units; in response to the tensor block number not being divisible by the number of available parallel computing units, selecting the other one or more dimensions to determine whether the tensor block number after the selected other one or more dimensions are cut is divisible by the number of available parallel computing units; determining that splitting the selected one or more other dimensions is the optimal splitting strategy in response to the number of tensor splits after splitting the selected one or more other dimensions being divisible by the number of available parallel computing units; in response to all selections of one or more of the dimensions failing to enable the tensor chunk number to be divisible by the number of available parallel computing units, determining that the one or more dimensions that maximize the remainder of dividing the tensor chunk number by the number of available parallel computing units or the number of available parallel units required for the tensor chunk number is the best segmentation strategy.

In an embodiment, the determining means 1020 is configured to perform the operator operation of the plurality of tensor blocks obtained after the tensor has been split according to the optimal splitting strategy by using the available parallel computing units by: in the case that the segmentation is performed by determining one or more dimensions with the largest remainder after dividing the tensor block number by the number of the available parallel computing units is the optimal segmentation strategy, performing operator operation of tensor blocks of a quotient with the number of tensor block numbers divided by the number of the available parallel computing units in parallel by utilizing each available parallel unit; or the operator operation of the tensor blocks of the quotient of which the number is divided by the number of the available parallel computing units is performed in parallel by utilizing each available parallel unit, and then the operator operation of the remaining tensor blocks of the remainder is performed in parallel by utilizing the available parallel units of which the number is the remainder of which the tensor block number is divided by the number of the available parallel computing units.

In an embodiment, the determining means 1020 is configured to perform the operator operation of the plurality of tensor blocks obtained after the tensor has been split according to the optimal splitting strategy by using the available parallel computing units by: in the case where it is determined that the segmentation is the best segmentation strategy by determining one or more dimensions that maximize the number of available parallel units required for the tensor blocking, the number of parallel computing units required is calculated from the tensor blocking number using the following formula: the number of parallel computing units required = ceil (tensor block number/number of available parallel computing units)), where ceil represents a ceiling; determining a tensor block number such that the segmentation of the one or more dimensions that maximizes the number of required parallel computing units is the optimal segmentation strategy; and (3) carrying out operator operation of a plurality of tensor blocks obtained after the tensor is segmented according to the optimal segmentation strategy in parallel by using the maximum number of parallel computing units.

In one embodiment, the selected one or more dimensions are selected by: according to the order of the number of the dimensions from small to large, each dimension in each dimension is selected, then various combinations of two dimensions in each dimension are selected, and then various combinations of three dimensions in each dimension are selected until all dimensions in each dimension are selected; or according to the sequence from small to large of the number of the dimensions and the sequence from large to small of the number of the dimension slices after the dimensions are segmented, firstly selecting each dimension in the dimensions, then selecting various combinations of two dimensions in the dimensions, and then selecting various combinations of three dimensions in the dimensions until all dimensions in the dimensions are selected; or the user selects one or more dimensions as the selected one or more dimensions; or randomly selecting one or more of the individual dimensions as the selected one or more dimensions.

In one embodiment, the splitting apparatus 1010 is configured to: dividing the size of each dimension of the selected one or more dimensions by the segmentation granularity of the dimension to obtain a dimension segmentation number of the dimension; multiplying all respective dimension cut numbers of the selected one or more dimensions to obtain a tensor block number of the one or more dimensions after being cut.

In one embodiment, the system 1000 further comprises: an application means (not shown in the figure) configured to, after determining that slicing the selected one or more dimensions is the optimal slicing strategy, apply, according to the operator operation of the tensor, the optimal slicing strategy also for the corresponding one or more dimensions of the output tensor and the other tensor with which the operator operation is performed.

The electronic device may include a processor (H1); a storage medium (H2) coupled to the processor (H1) and having stored therein computer executable instructions for performing the steps of the methods of the embodiments of the present application when executed by the processor.

The processor (H1) may include, but is not limited to, for example, one or more processors or microprocessors or the like.

The storage medium (H2) may include, for example, but is not limited to, random Access Memory (RAM), read Only Memory (ROM), flash memory, EPROM memory, EEPROM memory, registers, a computer storage medium (e.g., hard disk, a floppy disk, a solid state disk, a removable disk, a CD-ROM, a DVD-ROM, a blu-ray disc, etc.).

In addition, the electronic device may include, but is not limited to, a data bus (H3), an input/output (I/O) bus (H4), a display (H5), and an input/output device (H6) (e.g., keyboard, mouse, speaker, etc.), among others.

The processor (H1) may communicate with external devices (H5, H6, etc.) via a wired or wireless network (not shown) through an I/O bus (H4).

The storage medium (H2) may also store at least one computer executable instruction for performing the functions and/or steps of the methods in the embodiments described in the present technology when executed by the processor (H1).

In one embodiment, the at least one computer-executable instruction may also be compiled or otherwise formed into a software product in which one or more computer-executable instructions, when executed by a processor, perform the functions and/or steps of the methods described in the embodiments of the technology.

As shown in fig. 12, non-transitory computer-readable storage medium 1220 has instructions stored thereon, such as computer-readable instructions 1210. When executed by a processor, the computer-readable instructions 1210 may perform the various methods described with reference to the above. Non-transitory computer-readable storage media include, but are not limited to, for example, volatile memory and/or nonvolatile memory. Volatile memory can include, for example, random Access Memory (RAM) and/or cache memory (cache) and the like. The non-transitory non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, etc. For example, the non-transitory computer-readable storage medium 1220 may be connected to a computing device such as a computer, and then, the various methods described above may be performed where the computing device runs the computer-readable instructions 1210 stored on the computer-readable storage medium 1220.

Of course, the specific embodiments described above are merely examples and are not limiting, and those skilled in the art may combine and combine steps and means from the above separately described embodiments to achieve the effects of the present application according to the concepts of the present application, such combined and combined embodiments are also included in the present application, and such combination and combination are not described here one by one.

Note that the advantages, effects, and the like mentioned in the present disclosure are merely examples and are not to be construed as necessarily essential to the various embodiments of the present application. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, as the application is not intended to be limited to the details disclosed herein as such.

The block diagrams of the devices, apparatuses, devices, systems referred to in this disclosure are merely illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatuses, devices, systems may be connected, arranged, configured in any manner. Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.

Those of skill in the art would understand that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.

Moreover, those of skill would appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the aspects disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The various illustrative logical blocks, modules, and circuits described in connection with the aspects disclosed herein may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an ASIC, a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.

In one or more example aspects, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software (e.g., a computer program product comprising computer instructions), the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Further, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

While the foregoing disclosure shows illustrative aspects of the disclosure, it should be noted that various changes and modifications could be made herein without departing from the scope of the disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the aspects of the disclosure described herein need not be performed in any particular order. Furthermore, although elements of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

Claims

1. A tensor dimension splitting method, comprising:

dividing one or more selected dimensions according to dividing granularity of each dimension of the tensor to obtain tensor block numbers after dividing the one or more selected dimensions;

determining whether the segmentation of the selected one or more dimensions is an optimal segmentation strategy according to the tensor block number and the number of a plurality of available parallel computing units.

2. The method of claim 1, wherein determining whether splitting the selected one or more dimensions is the best splitting policy based on the tensor partition number and a number of available parallel computing units comprises:

Determining that splitting the selected one or more dimensions is the optimal splitting strategy if the tensor partitioning number can maximize utilization of the plurality of available parallel computing units according to the tensor partitioning number;

and performing operator operation of a plurality of tensor blocks obtained after the tensor is segmented according to the optimal segmentation strategy by using the available parallel computing unit.

3. The method of claim 2, wherein determining that splitting the selected one or more dimensions is the best splitting policy based on the tensor chunk number if the tensor chunk number can maximize utilization of the number of available parallel computing units comprises:

determining that splitting the selected one or more dimensions is the best splitting strategy in response to the tensor partition number being divisible by the number of available parallel computing units.

4. The method of claim 2, wherein determining that splitting the selected one or more dimensions is the best splitting policy based on the tensor chunk number if the tensor chunk number can maximize utilization of the number of available parallel computing units comprises:

in response to the tensor chunk number not being divisible by the number of available parallel computing units, selecting one or more other dimensions to determine whether the selected one or more other dimensions are divisible by the number of available parallel computing units.

5. The method of claim 4, wherein determining that splitting the selected one or more dimensions is the best splitting policy based on the tensor chunk number if the tensor chunk number can maximize utilization of the number of available parallel computing units comprises:

determining that splitting the selected one or more other dimensions is the best splitting strategy in response to the number of tensor splits after splitting the selected one or more other dimensions being divisible by the number of available parallel computing units.

6. The method of claim 2, wherein determining that splitting the selected one or more dimensions is the best splitting policy based on the tensor chunk number if the tensor chunk number can maximize utilization of the number of available parallel computing units comprises:

determining that the one or more dimensions that maximize the remainder of dividing the tensor chunk number by the number of available parallel computing units or the number of available parallel units required for the tensor chunk number is the best splitting strategy in response to all selections of one or more of the dimensions not being such that the tensor chunk number is divisible by the number of available parallel computing units.

7. The method according to claim 3 or 5, wherein performing, with the available parallel computing unit, an operator operation of a plurality of tensor blocks obtained after the tensor is segmented according to the optimal segmentation strategy, comprises:

and carrying out operator operation of tensor blocks, the number of which is the quotient of the tensor blocks divided by the number of the available parallel computing units, in parallel by utilizing each available parallel unit.

8. The method of claim 6, wherein performing, with the available parallel computing unit, an operator operation of a plurality of tensor blocks obtained after the tensor is segmented according to the optimal segmentation policy comprises:

in the case where it is determined that the one or more dimensions that maximizes the remainder of dividing the tensor-block number by the number of available parallel computing units are the optimal segmentation strategy to perform segmentation, performing, in parallel, an operator operation of tensor blocks of a quotient of the tensor-block number divided by the number of available parallel computing units with each available parallel unit, and performing, in parallel, an operator operation of remainder of the remaining tensor-blocks with an available parallel unit of a remainder of the tensor-block number divided by the number of available parallel computing units.

9. The method of claim 6, wherein performing, with the available parallel computing unit, an operator operation of a plurality of tensor blocks obtained after the tensor is segmented according to the optimal segmentation policy comprises:

in case that determining the one or more dimensions that maximizes the number of available parallel units needed for the tensor partition is the optimal partitioning strategy, the number of parallel computing units needed is calculated from the tensor partition by means of the following formula:

the number of parallel computing units required = ceil (tensor block number/number of available parallel computing units)), where ceil represents a ceiling;

determining the tensor partition number such that the segmentation of the one or more dimensions that maximizes the number of required parallel computing units is the optimal segmentation strategy;

and the maximum number of parallel computing units are utilized to carry out the operator operation of a plurality of tensor blocks obtained after the tensor is segmented according to the optimal segmentation strategy in parallel.

10. The method of claim 1, wherein the selected one or more dimensions are selected by:

according to the order of the number of the dimensions from small to large, each dimension in each dimension is selected, then various combinations of two dimensions in each dimension are selected, and then various combinations of three dimensions in each dimension are selected until all dimensions in each dimension are selected.

11. The method of claim 1, wherein the selected one or more dimensions are selected by:

according to the sequence from small to large in number of dimensions and the sequence from large to small in number of dimension slices after each dimension is segmented, each dimension in each dimension is selected, then various combinations of two dimensions in each dimension are selected, and then various combinations of three dimensions in each dimension are selected until all dimensions in each dimension are selected.

12. The method of claim 1, wherein segmenting the selected one or more dimensions according to the segmentation granularity of each dimension of the tensor, obtaining the segmented tensor chunk number of the selected one or more dimensions comprises:

dividing the size of each dimension of the selected one or more dimensions by the segmentation granularity of the dimension to obtain a dimension segmentation number of the dimension;

multiplying all respective dimension cut numbers of the selected one or more dimensions to obtain tensor block numbers after the selected one or more dimensions are cut.

13. The method of claim 1, wherein the method further comprises:

after determining that splitting the selected one or more dimensions is the optimal splitting strategy, applying the optimal splitting strategy to one or more dimensions corresponding to another tensor and an output tensor of which operator operation is performed on the tensor according to operator operation of the tensor.

14. A tensor dimension splitting system, comprising:

the segmentation device is configured to segment the selected one or more dimensions according to the segmentation granularity of each dimension of the tensor to obtain the tensor block number after the selected one or more dimensions are segmented;

a determining means configured to determine whether splitting the selected one or more dimensions is the best splitting strategy based on the tensor partition number and the number of the plurality of available parallel computing units.

15. An electronic device, comprising:

a memory for storing instructions;

a processor for reading instructions in said memory and performing the method of any of claims 1-13.

16. A non-transitory storage medium having instructions stored thereon,

wherein the instructions, when read by a processor, cause the processor to perform the method of any of claims 1-13.