CN117632085A

CN117632085A - Method, computing device, and storage medium for mask tensor conversion

Info

Publication number: CN117632085A
Application number: CN202410101362.3A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Bi Ren Technology Co ltd; Beijing Bilin Technology Development Co ltd
Current assignee: Shanghai Bi Ren Technology Co ltd; Beijing Bilin Technology Development Co ltd
Priority date: 2024-01-24
Filing date: 2024-01-24
Publication date: 2024-03-01
Anticipated expiration: 2044-01-24
Also published as: CN117632085B

Abstract

Embodiments of the present invention relate to a method, computing device, and storage medium for mask tensor conversion. The method comprises the following steps: converting the original mask tensor into a first intermediate mask tensor based on a predetermined conversion rule, wherein data in the first intermediate mask tensor is bit-type data; grouping data of the first intermediate mask tensor in a first dimension to obtain a plurality of initial data blocks; and rearranging data in the plurality of initial data blocks to obtain a second intermediate mask tensor, wherein the second intermediate mask tensor comprises a plurality of rearranged data blocks, wherein each rearranged data block comprises a portion of the data in each initial data block. The embodiment of the invention can compress the size of the original mask tensor, reduce the bandwidth pressure of the board in the data calculation process, avoid data blocking in the data loading stage and greatly improve the execution performance of data calculation on the board.

Description

Method, computing device, and storage medium for mask tensor conversion

Technical Field

Embodiments of the present invention relate generally to the field of data computation and, more particularly, relate to a method, computing device, and storage medium for mask tensor conversion.

Background

In training for a transducer model (such as a language model, a meritorious graph model), a data calculation based on a mask_fill function needs to be performed a plurality of times based on an attention (attention) mechanism. The existing method for performing data calculation based on the mask_fill function comprises the following steps: receiving an input tensor, a mask tensor and a fill value; and determining an output tensor via a mask_fill function based on the input tensor, the mask tensor, and the fill value. However, in the above-described prior art, the performance ratio of receiving and loading the input tensor and the mask tensor, and outputting and storing the output tensor in the data calculation process is large, while the performance ratio of determining the values of the elements in the output tensor in the data calculation process is small, so that a relatively serious pressure is built up on the bandwidth of the board that performs the data calculation, resulting in poor performance of the data calculation.

In summary, the conventional method for performing data computation based on the masked_fill function forms a serious stress on the bandwidth of the board card performing the data computation, so that the performance of the data computation is poor.

Disclosure of Invention

In view of the above problems, the present invention provides a method, a computing device, and a storage medium for mask tensor conversion, which enable the size of a mask tensor to be subjected to data computation to be compressed, thereby reducing the pressure on board bandwidth in the data computation process and improving the execution performance of the data computation.

According to a first aspect of the present invention there is provided a method for masking tensor conversion, comprising: converting the original mask tensor into a first intermediate mask tensor based on a predetermined conversion rule, wherein data in the first intermediate mask tensor is bit-type data; grouping data of the first intermediate mask tensor in a first dimension to obtain a plurality of initial data blocks; and rearranging data in the plurality of initial data blocks to obtain a second intermediate mask tensor, wherein the second intermediate mask tensor comprises a plurality of rearranged data blocks, wherein each rearranged data block comprises a portion of the data in each initial data block.

In some embodiments, the data in the original mask tensor is boolean data. In these embodiments, converting the original mask tensor to the first intermediate mask tensor based on the predetermined conversion rule includes: converting the Boolean data with the value of true in the original mask tensor into bit type data with a first value; and converting the Boolean data with the value of false in the original mask tensor into bit type data with a second value. .

In some embodiments, grouping the data of the first intermediate mask tensor in the first dimension to obtain the plurality of initial data chunks includes: determining a data length for the packet; and grouping the data of the first intermediate mask tensor in the first dimension based on the determined data length for grouping so that the sizes of the plurality of initial data blocks obtained after grouping are the same.

In some embodiments, the data length for a packet is related to the type of data in the resulting mask tensor to be obtained.

In some embodiments, grouping the data of the first intermediate mask tensor in the first dimension to obtain the plurality of initial data chunks further comprises: in response to the data length of the first intermediate mask tensor in the first dimension not being divisible by the data length for the packet, the data of the first intermediate mask tensor in the first dimension is padded such that the data length of the padded first intermediate mask tensor in the first dimension is divisible by the data length for the packet.

In some embodiments, rearranging the data in the plurality of initial data blocks to obtain the second intermediate mask tensor comprises: determining, for each initial data block, a first index value of the initial data block in a first dimension and a second index value in a second dimension; determining, for each data in the initial data block, a third index value of the data in the initial data block; and determining a position of the data in the second intermediate mask tensor based at least on the first index value and the second index value of the initial data block and a third index value of the data in the initial data block, so as to move the data to a corresponding position of a corresponding rearranged data block.

In some embodiments, determining the location of the data in the second intermediate mask tensor comprises: determining a rearranged data block corresponding to the data based on a third index value of the data in the initial data block and a second index value of the initial data block; and determining a location of the data in the determined rearranged data block based on the first index value and the second index value of the initial data block.

In some embodiments, the method for masking tensor conversion further comprises: the second intermediate mask tensor is compressed such that the data in the second intermediate mask tensor is converted into integer type data or floating point type data to obtain a result mask tensor.

According to a second aspect of the present invention there is provided a method for data computation, comprising: receiving input parameters, the input parameters comprising: inputting tensors, original mask tensors and filling values; converting the original mask tensor according to the method of the first aspect of the invention to obtain a resulting mask tensor; and determining an output tensor based on the input tensor, the result mask tensor, and the fill value.

According to a third aspect of the present invention there is provided a computing device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect of the invention.

According to a fourth aspect of the present invention there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of the first aspect of the present invention.

According to a fifth aspect of the present invention there is provided a computer program product, wherein the computer program product is tangibly stored on a non-transitory computer-readable medium and comprises machine executable instructions which, when executed, cause a machine to perform the steps in the method of the first aspect of the present invention.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

The above and other features, advantages and aspects of embodiments of the present invention will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, the same or similar reference numerals denote the same or similar elements.

Fig. 1 shows a schematic diagram of a system for masking tensor conversion according to an embodiment of the invention.

Fig. 2 shows a flow chart of a method for masking tensor conversion according to an embodiment of the invention.

Fig. 3a shows a schematic diagram of a first intermediate mask tensor according to an embodiment of the invention.

Fig. 3b shows a schematic diagram of the principle of rearranging data in a first intermediate mask tensor to obtain a second intermediate mask tensor according to an embodiment of the invention.

Fig. 4 shows a schematic diagram of the principle of compressing the second intermediate mask tensor to obtain a resulting mask tensor according to an embodiment of the invention.

FIG. 5 shows a schematic diagram of a system for data computation according to an embodiment of the invention.

FIG. 6 shows a flow chart of a method for data computation according to an embodiment of the invention.

Fig. 7 shows an exemplary schematic diagram of converting an original mask tensor into a resulting mask tensor according to an embodiment of the invention.

FIG. 8 schematically illustrates a block diagram of a computing device suitable for use in implementing embodiments of the present invention.

Detailed Description

Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The term "comprising" and variations thereof as used herein means open ended, i.e., "including but not limited to. The term "or" means "and/or" unless specifically stated otherwise. The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment. The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other explicit and implicit definitions are also possible below.

In training for a Transformer model, such as a language model, a meritorious graph model, based on attention (attention) mechanisms, the mask_fill function needs to be used multiple times to enable substitution of at least some elements in the input tensor.

Typically, the input parameters of the masked_fill function include: input tensors, mask (mask) tensors, and fill values. Wherein the mask tensor is a boolean tensor of the same shape and size as the input tensor, which can be used to indicate the location of the element to be replaced in the input tensor; the fill value may be a scalar quantity to indicate the replaced element value. The existing scheme for performing data calculation based on the mask_fill function comprises the following steps: receiving input parameters, wherein the input parameters comprise an input tensor, a mask tensor and a filling value; and determining an output tensor via a mask_fill function based on the input tensor, the mask tensor, and the fill value. For example, if the boolean value of an element in the mask tensor is "True", then determining that the value of the element in the output tensor corresponding to the index position of the element in the mask tensor takes a padding value; if the boolean value of an element in the mask tensor is "False", then the value of the element in the output tensor corresponding to the index position of the element in the mask tensor is determined to take the value of the corresponding element in the input tensor.

However, in the above scheme of performing data computation based on the mask_fill function, the performance ratio of receiving and loading the input tensor and the mask tensor, and outputting and storing the output tensor in the data computation process is large, and the performance ratio of determining the values of the elements in the output tensor in the data computation process is small, so that a relatively serious pressure is formed on the bandwidth of a board (e.g., a general purpose computing graphics processing unit (GPGPU) or a plurality of parallel Graphics Processing Units (GPUs)) performing the data computation, resulting in poor performance of the data computation.

To at least partially address one or more of the above problems, as well as other potential problems, example embodiments of the present invention propose a solution for masking tensor conversion. In this scheme, the original mask tensor is converted into a first intermediate mask tensor by being based on a predetermined conversion rule, wherein the data in the first intermediate mask tensor is bit-type data; grouping data of the first intermediate mask tensor in a first dimension to obtain a plurality of initial data blocks; and rearranging the data in the plurality of initial data blocks to obtain a second intermediate mask tensor, wherein the second intermediate mask tensor comprises a plurality of rearranged data blocks, wherein each rearranged data block comprises a part of data in each initial data block, so that the size of the original mask tensor can be compressed, and the pressure of the board bandwidth in the data calculation process is reduced. The example embodiment of the invention also provides a scheme for carrying out data calculation based on the converted mask tensor, so that data blocking in a data loading stage can be avoided, and the execution performance of data calculation on a board card is greatly improved.

Schemes for mask tensor conversion according to embodiments of the present invention will be described in detail below with reference to fig. 1 to 4. The method 200 may be performed by a computing device 800 as shown in fig. 8. It should be appreciated that method 200 may also include additional actions not shown and/or may omit actions shown, the scope of the invention being not limited in this respect.

Fig. 1 shows a schematic diagram of a system 100 for masking tensor conversion according to an embodiment of the invention. It should be appreciated that system 100 may also include additional modules not shown and/or that the illustrated modules may be omitted, as the scope of the invention is not limited in this respect.

As shown in fig. 1, the system 100 includes: an original mask tensor receiving module 110, a converting module 120, a grouping module 130, a reordering module 140, a compressing module 150, and a result mask tensor output module 160.

Regarding the original mask tensor receiving module 110, it may be configured to receive the original mask tensor. According to an embodiment of the invention, the original mask tensor is typically a boolean tensor, i.e. the type of data in the original mask tensor is boolean.

With respect to the conversion module 120, it may be configured to convert the original mask tensor into a first intermediate tensor, wherein the data in the first intermediate mask tensor is bit-type data.

With respect to the grouping module 130, it may be configured to group the data in the first intermediate tensor. According to an embodiment of the invention, data in the first intermediate tensor is grouped in a first dimension to obtain a plurality of initial data blocks.

With respect to the rearrangement module 140, it may be configured to rearrange the data in the plurality of initial data blocks in the first intermediate tensor to obtain the second intermediate mask tensor. According to an embodiment of the invention, the second intermediate mask tensor comprises a plurality of rearranged data chunks, and each rearranged data chunk comprises a portion of the data in each initial data chunk.

With respect to compression module 150, it may be configured to compress the second intermediate mask tensor such that the data in the second intermediate mask tensor is converted into integer-type data or floating-point-type data to obtain a result mask tensor.

With respect to the result mask tensor output module 160, it may be configured to output the result mask tensor to facilitate subsequent loading of the result mask tensor into, for example, a masked_fill function to perform the corresponding data calculation.

Fig. 2 illustrates a flow chart of a method 200 for masking tensor conversion according to an embodiment of the invention.

At step 202, the original mask tensor is converted by the system 100 into a first intermediate mask tensor based on a predetermined conversion rule, wherein the data in the first intermediate mask tensor is bit-type data.

Regarding the predetermined conversion rule, it may be a rule for instructing to convert boolean data having different boolean values into specific bit-type data.

According to an embodiment of the present invention, boolean data having a value of "True (True)" and boolean data having a value of "false (True)" in an original mask tensor are converted into bit-type data having different values, respectively. For example, in some embodiments, boolean data having a value of "true" may be converted to bit-type data having a value of 1, and boolean data having a value of "false" may be converted to bit-type data having a value of 0. In still other embodiments, boolean data with a value of "true" may be converted to bit-type data with a value of 0, and boolean data with a value of "false" may be converted to bit-type data with a value of 1.

At step 204, the data of the first intermediate mask tensor is grouped by the system 100 in a first dimension to obtain a plurality of initial data chunks.

With respect to the first dimension, it may refer to a width dimension. According to an embodiment of the present invention, grouping the data of the first intermediate mask tensor in the first dimension may mean grouping a plurality of consecutive bit-type data into a group in the width dimension of the first intermediate mask tensor, thereby implementing the grouping of the data of the first intermediate mask tensor.

Regarding the grouping of the data of the first intermediate mask tensor, according to an embodiment of the invention, it may comprise: determining a data length for the packet; and grouping the data of the first intermediate mask tensor in the first dimension based on the determined data length for grouping so that the sizes of the plurality of initial data blocks obtained after grouping are the same.

As to the data length for the packet, it may be related to the type of data in the resulting mask tensor to be obtained. In one example, if the type of data in the result mask tensor is, for example, single precision floating point type (fp 32), then the data length for the packet is determined to be 32 bits. In yet another example, the type of data in the result mask tensor is, for example, a 64-bit integer type (int 64), then the data length for the packet is determined to be 64 bits.

In an example of the present invention, where the data length of the first intermediate mask tensor in the width dimension is 512 bits, and the data length for the packet is determined to be 32 bits based on the type of data in the resulting mask tensor, then each successive 32 bits of data in the first intermediate mask tensor may be grouped in the width dimension of the first intermediate mask tensor, resulting in 16 initial data chunks of the same size.

In some embodiments of the present invention, the data length of the first intermediate mask tensor in the width dimension may not be divisible by the data length for grouping, in which case the data of the first intermediate mask tensor need to be first bit-filled in the width dimension and then the bit-filled first intermediate mask tensor is grouped. Specifically, according to an embodiment of the present invention, grouping the data of the first intermediate mask tensor in the first dimension to obtain a plurality of initial data blocks may include: determining whether the data length of the first intermediate mask tensor in the first dimension is divisible by the data length for the packet; and responsive to the data length of the first intermediate mask tensor in the first dimension not being divisible by the data length for the packet, bit-filling the data of the first intermediate mask tensor in the first dimension such that the bit-filled data length of the first intermediate mask tensor in the first dimension is divisible by the data length for the packet.

For example, in one example of the present invention, the data length of the first intermediate mask tensor in the first dimension is 508 bits, and if the data length for the packet is determined to be 32 bits based on the type of data in the result mask tensor, it is determined that the data length of the first intermediate mask tensor in the first dimension (i.e., 508 bits) cannot be divided by the data length for the packet (i.e., 32 bits). In this case, the data of the first intermediate mask tensor may be padded in the first dimension, for example, by padding 0 at the end, so that the data length of the first intermediate mask tensor in the first dimension is padded to a length that can be divided by the data length (i.e., 32 bits) for the packet. For example, in the above example, the data length of the first intermediate mask tensor in the first dimension may be padded, such as to 512 bits, by filling 0 at the end, so as to be grouped based on the padded first intermediate mask tensor.

At step 206, the data in the plurality of initial data chunks is rearranged by the system 100 to obtain a second intermediate mask tensor.

With respect to the second intermediate mask tensor, it may include a plurality of rearranged data chunks, wherein each rearranged data chunk includes a portion of the data in each initial data chunk. Furthermore, according to an embodiment of the present invention, the number of rearranged data chunks in the second intermediate mask tensor may be the same as the number of initial data chunks in the first intermediate mask tensor.

According to the embodiment of the invention, the data in the initial data block is rearranged and shifted to obtain the rearranged data block, so that the result mask tensor obtained based on the second intermediate mask tensor formed by the rearranged data block is more easily loaded to the masked_fill function to perform data calculation, and the masked_fill function can be realized on the board with high performance.

With respect to rearranging data in the plurality of initial data chunks to obtain a second intermediate mask tensor, according to an embodiment of the invention, it may include: determining, for each initial data block, a first index value of the initial data block in a first dimension and a second index value in a second dimension; determining, for each data in the initial data block, a third index value of the data in the initial data block; and determining a position of the data in the second intermediate mask tensor based at least on the first index value and the second index value of the initial data block and a third index value of the data in the initial data block, so as to move the data to a corresponding position of a corresponding rearranged data block. Further, determining the location of the data in the second intermediate mask tensor may include: determining a rearranged data block corresponding to the data based on a third index value of the data in the initial data block and a second index value of the initial data block; and determining a location of the data in the determined rearranged data block based on the first index value and the second index value of the initial data block. The details will be described in connection with fig. 3a and 3b, which are not repeated here.

According to an embodiment of the present invention, the method 200 may further include: at step 208, the second intermediate mask tensor is compressed by the system 100 such that the data in the second intermediate mask tensor is converted to integer-type data or floating-point-type data to obtain a result mask tensor.

As for the integer type data, it may be, for example, 16-bit integer type (int 16) data, 32-bit integer type (int 32) data, 64-bit integer type (int 64) data, or the like.

As for the floating point type data, it may be, for example, half-precision floating point type (fp 16) data, single-precision floating point type (fp 32) data, double-precision floating point type (fp 64) data, brain floating point type (bf 16) data, or the like.

By compressing the second intermediate mask tensor is meant compressing each rearranged data chunk in the second intermediate mask tensor to convert the rearranged data chunk into corresponding integer or floating point type data to obtain a resulting mask tensor. The details will be described in conjunction with fig. 4, which will not be repeated here.

An example implementation of the rearranged conversion of the first intermediate mask tensor into the second intermediate mask tensor according to an embodiment of the present aspect will be described in detail below in connection with fig. 3a and 3 b. Fig. 3a shows a schematic diagram of a first intermediate mask tensor 310 according to an embodiment of the invention. Fig. 3b shows a schematic diagram of the principle of rearranging data in the first intermediate mask tensor 310 to obtain the second intermediate mask tensor 320 according to an embodiment of the present invention.

As shown in fig. 3a, the first intermediate mask tensor 310 has a data length of 512 bits in a first dimension (i.e., width dimension (W dimension)) and a size of 2 in a second dimension (i.e., height dimension (H dimension)). That is, the first intermediate mask tensor 310 shown in fig. 3a is a tensor of 512 bits×2 rows.

According to an embodiment of the invention, for each line of data in the first intermediate mask tensor 310, the data of the first intermediate mask tensor 310 is grouped in a first dimension with 32 bits as the data length for the grouping. As shown in fig. 3a, the first intermediate mask tensor 310 includes 32 initial data blocks in total, wherein each row of data in the first intermediate mask tensor 310 is divided into 16 initial data blocks of 32 bits in size.

Further, the position of each bit type data in each initial data block may be represented by a third index value (bit_in_block_idx_original). According to an embodiment of the present invention, the index of the bit-type data in the initial data block may be calculated from, for example, 0, i.e., the third index value bit_in_block_idx_original=n-1 corresponding to the nth bit-type data in the initial data block, where n is a positive integer. For example, the third index value of the 1 st bit type data in the initial data block is 0, the third index value of the 2 nd bit type data in the initial data block is 1, and so on.

With respect to the initial data block, its position in the first intermediate mask tensor 310 may be indicated by a position index, where the position index may include: a first index value (block_idx_original) in a first dimension and a second index value (h_original) in a second dimension. The second index value may be used to indicate which row of the first intermediate mask tensor 310 the initial data chunk is in, while the first index value may be used to indicate which data chunk the initial data chunk is in.

Taking the initial data block 312 in the first intermediate mask tensor 310 as an example, if each index is calculated from 0, as shown in fig. 3a, the second index value h_original=0 of the initial data block 312 in the second dimension, and the first index value block_idx_original=0 in the first dimension, it is known that the initial data block 312 is the 1 st data block of the first row in the first intermediate mask tensor 310. Accordingly, for the initial data block 314, its second index value h_original=1 in the second dimension, and the first index value block_idx_original=15 in the first dimension, it indicates that the initial data block 314 is the 16 th data block of the second row in the first intermediate mask tensor 310.

As shown in fig. 3b, the bit-wise data in each initial data block in the first intermediate mask tensor 310 is rearranged into a different rearranged data block in the second intermediate mask tensor 320. Specifically, to which rearranged data block of the second intermediate mask tensor 320 the bit-type data is rearranged and the specific position of the rearranged data block may be determined based on the third index value of the bit-type data in the initial data block, and the first index value and the second index value of the initial data block in which the bit-type data is located.

According to an embodiment of the present invention, the width index value (block_idx_transferred) and the height index value (h_transferred) of the rearranged data block and the new third index value (bit_in_block_idx_transferred) of the bit data in the rearranged data block may be calculated based on, for example, the following conversion formulas. Specifically, the conversion formula may include the following formulas (1) to (3):

H_transferred = H_original // 2 （1）

block_idx_transferred = bit_in_block_idx_original （2）

bit_in_block_idx_transferred = 15 - block_idx_original + H_original % 2×16 （3）

as can be seen from the above formula (1), the height index value of the rearranged data block can be determined based on the second index value of the initial data block in the second dimension (i.e., the height index value of the initial data block), where "//" is an integer division operator for obtaining the integer part of the division operation. For example, in the embodiment shown in fig. 3b, the second index value h_original of the initial data block in the second dimension may be divided by 2 to obtain the height index value h_transferred of the rearranged data block.

As can be seen from the above formula (2), the width index value of the rearranged data block can be determined based on the third index value of the bit-type data in the initial data block. For example, in the embodiment shown in fig. 3b, the width index value block_idx_transferred of the rearranged data block is equal to the third index value bit_in_block_idx_original of the bit-type data in the initial data block.

As can be seen from the above formula (3), a new third index value of the bit-type data in the rearranged data block can be determined based on the first index value of the initial data block in the first dimension and the second index value of the bit-type data in the second dimension, where "%" is a modulo operator used for remainder of the division operation. For example, in the embodiment shown in fig. 3b, the new third index value bit_in_block_idx_transferred of the bit-type data in the rearranged data block may be determined by multiplying 16 by the remainder of dividing the second index value h_original of the original data block in the second dimension by 2, and performing a corresponding addition and subtraction operation on the result of the multiplication operation and the first index value block_idx_original of the original data block in the first dimension.

For example, the 1 st bit type data in the initial data block 312 is taken as an example. As described above, the first index value block_idx_original=0 of the initial data block 312 in the first dimension, the second index value h_original=0// 2=0 of the second dimension, and the third index value of the 1 st bit type data in the initial data block 312 is bit_in_block_idx_original=0. As can be seen from the conversion formula described above, the bit-type data is to be moved to the rearranged data block with the height index value h_transferred=0, the width index value block_idx_transferred=0, and the new third index value bit_in_block_idx_transferred=15-0+0% 2×16=15 in the rearranged data block.

In yet another example, the 16 th bit type data in the initial data block 314 is taken as an example. As described above, the first index value block_idx_original=15 of the initial data block 314 in the first dimension, the second index value h_original=1 of the second dimension, and the third index value of the 32 nd bit type data in the initial data block 314 is bit_in_block_idx_original=31. As can be seen from the conversion formula described above, the bit-type data is to be moved to the rearranged data block with the height index value h_transferred=1// 2=0, the width index value block_idx_transferred=31, and the new index value bit_in_block_idx_transferred=15-15+1% 2×16=16 in the rearranged data block.

As can be seen from the above, the first intermediate mask tensor 310 can be converted into the second intermediate mask tensor 320 according to the above rearrangement operation of the present invention. As shown in fig. 3a and 3b, the rearranged second intermediate mask tensor 320 is compressed to be 1/2 of the original in the second dimension (i.e., the height dimension) with respect to the first intermediate mask tensor 310, that is, the second intermediate mask tensor 320 may be compressed to be 1/2 of the original in the second dimension with respect to the original mask tensor.

Fig. 4 shows a schematic diagram of the principle of compressing the second intermediate mask tensor 320 to obtain the resulting mask tensor 330 according to an embodiment of the present invention.

As described above, according to embodiments of the present invention, each rearranged data block in the second intermediate mask tensor may be compressed to convert the rearranged data block into corresponding integer type data or floating point type data so as to obtain the result mask tensor.

As shown in fig. 4, the second intermediate mask tensor 320 includes 32 rearranged data blocks. Taking the first rearranged data chunk 322 in the second intermediate mask tensor 320 as an example, the first rearranged data chunk 322 is compressed, so that the first rearranged data chunk 322 is converted into floating point type data (such as fp 32) or integer type data (such as int 32), and the converted data is the first data in the result mask tensor 330. Further, according to an embodiment of the present invention, the 32 rearranged data blocks in the second intermediate mask tensor 320 in fig. 4 may be respectively converted into corresponding floating point type data or integer type data, and the result mask tensor 330 may be formed based on the converted 32 floating point type data or integer type data. Taking floating point type data as an example, in some embodiments of the present invention, the second intermediate mask tensor 320 may include 32 rearranged data blocks that are respectively converted into corresponding floating point type data, and the result mask tensor 330 may be formed based on the converted 32 floating point type data.

As can be seen from the above, according to the above compression operation of the present invention, the resulting mask tensor 330 is compressed to be 1/16 of the original in the first dimension (i.e., width dimension) with respect to the second intermediate mask tensor 320. That is, the resulting mask tensor 330 may be further compressed to the original mask tensor 1/32.

According to the scheme for mask tensor conversion, the Boolean data in the original mask tensor are converted into the bit type data, and the converted bit type data are rearranged and compressed to obtain the result mask tensor, so that the pressure of the bandwidth of the board in the mask tensor loading process can be reduced, and the performance benefit brought by improving the bandwidth pressure of the board can be fully expanded by performing data calculation on the basis of the obtained result mask tensor, and the execution performance of data calculation on the board is greatly improved.

In addition, the result tensor obtained by rearrangement and compression based on the conversion formula can directly perform operation of a mask_fill function after being loaded into hardware, so that operation efficiency is improved.

The scheme of performing data calculation based on the converted result mask tensor on the above scheme of converting the mask tensor will be described in detail below with reference to fig. 5 to 7.

Fig. 5 shows a schematic diagram of a system 500 for data computation according to an embodiment of the invention. It should be appreciated that system 500 may also include additional elements not shown and/or may omit elements shown, the scope of the invention being not limited in this respect.

As shown in fig. 5, the system 500 includes: a parameter receiving unit 510, an original mask tensor converting unit 520, a calculating unit 530, and an output unit 540.

Regarding the parameter receiving unit 510, it may be configured to receive input parameters to be used for data calculation. According to an embodiment of the invention, the parameter receiving unit 510 may be configured to accept the input tensor, the original mask tensor and the padding value.

Regarding the input tensor, it may be a tensor to be used for large model training. According to an embodiment of the invention, the input tensor may be a floating point type tensor or an integer type tensor, which is commonly used for large model training, i.e. the type of data in the input tensor may be a floating point type (such as fp16, fp32, bf 16) or an integer type (such as int16, int 32).

As for the original mask tensor, as indicated above, it may be a boolean tensor and is used to indicate the location of the element to be replaced in the input tensor. According to an embodiment of the invention, the original mask tensor is the same shape and size as the input tensor.

With respect to the fill value, it may be used to determine the values of some of the elements in the output tensor.

Regarding the original mask tensor conversion unit 520, it may be configured to convert the original mask tensor to obtain a resulting mask tensor according to the method for mask tensor conversion as described above.

Regarding the result mask tensor, it may be a floating point type tensor or an integer type tensor, i.e., the type of data in the input tensor may be a floating point type (such as fp16, fp32, bf 16) or an integer type (such as int16, int 32).

Regarding the calculation unit 530, it may be configured to perform, for example, a masked_fill function operation. According to an embodiment of the present invention, the calculation unit 530 may load the input tensor, the result mask tensor and the padding value into the masked_fill function to perform an operation so as to determine the output tensor.

As for the output unit 540, it may be configured to output the output tensor calculated by the calculation unit 530.

Fig. 6 shows a flow chart of a method 600 for data computation according to an embodiment of the invention. The method 600 may be performed by an electronic device 800 as shown in fig. 8. It should be appreciated that method 600 may also include additional actions not shown and/or may omit actions shown, the scope of the invention being not limited in this respect.

In step 602, input parameters are received by the parameter receiving unit 510 of the system 500, wherein the input parameters include: input tensor, original mask tensor and padding value.

In embodiments of the present invention, the shape of the tensor may be expressed in the form of, for example, (N, H, W) commonly used in the attention mechanism, where N represents the batch size (batch_size), H represents the height (height), and W represents the width (width). On this basis, the input tensor may have a first tensor shape, e.g. denoted (n_input, h_input, w_input). The original mask tensor may have a second tensor shape, e.g., denoted as (n_original, h_original, w_original). According to an embodiment of the invention, the shape of the input tensor is the same as the shape of the original mask tensor, i.e. the first tensor shape is the same as the second tensor shape. That is, n_input=n_original, h_input=h_original, and w_input=w_original.

In step 604, the original mask tensor is converted by the original mask tensor conversion unit 520 of the system 500 to obtain a resultant mask tensor.

Regarding the conversion of the original mask tensor to obtain the resulting mask tensor, reference may be made to the schemes for mask tensor conversion described above in connection with fig. 1 to 4, which are not described here again.

Further, the result mask tensor may have a third tensor shape, e.g., denoted as (n_transferred, h_transferred, w_transferred). As described above, the resulting mask tensor obtained according to the embodiment of the present invention may be compressed to different degrees in the height dimension (i.e., the H dimension) and the width dimension (i.e., the W dimension) with respect to the original mask tensor, so that the pressure on the board bandwidth may be reduced in the process of loading the resulting mask tensor.

For example, in one embodiment of the present invention, assuming that the original mask tensor has a second tensor shape denoted as (1, 2, 512), i.e., n_original=1, h_original=2, w_original=512, and assuming that the resulting mask tensor is an fp 32-type tensor, the third tensor shape of the resulting mask tensor may have the following relationship with the second tensor shape of the original mask tensor:

N_transferred = N_original = 1；

H_transferred = (N_original + 1) // 2 = 1；

W_transferred = (W_original + 511) // 512 × 32 = 32。

from the above, the resulting mask tensor is compressed 2 times in the height dimension and 16 times in the width dimension compared to the original mask tensor.

In still other embodiments of the present invention, if the data length of the original mask tensor in the width dimension exceeds 512 bits, the original mask tensor may be segmented into a plurality of sub-original mask tensors with the data length of 512 bits in the width dimension, for example, with a granularity of 512 bits, and each sub-original mask tensor is subjected to a rearrangement compression operation as described in fig. 3a, 3b and 4 to obtain a corresponding sub-result mask tensor, and then the obtained sub-result mask tensors are sequentially spliced together to obtain a final result mask tensor, which may be referred to the schematic diagram shown in fig. 7.

As shown in fig. 7, the data length of the original mask tensor 710 in the width dimension is 1024, and the data length in the height dimension is 1024, that is, h_original=1024 and w_original=1024 of the original mask tensor 710. Here, the data in the original mask tensor 710 is boolean data.

As shown in fig. 7, the original mask tensor 710 is split into 4 sub-original mask tensors, namely sub-original mask tensor 720a, sub-original mask tensor 720b, sub-original mask tensor 720c, and sub-original mask tensor 720d (collectively sub-original mask tensor 720). Each sub-original mask tensor has a data length of 512 in the width dimension and a data length of 512 in the height dimension. That is, h_sub-original=512 of sub-original mask tensor 720, and w_sub-original=512.

As shown in fig. 7, each sub-original mask tensor 720 is converted into a corresponding sub-result mask tensor 730 after being subjected to a rearrangement compression operation, wherein the data length of each sub-result mask tensor 730 in the width dimension is 32, and the data length in the height dimension is 256. That is, h_sub-transferred=256 and w_sub-transferred=32 of the sub-result mask tensor 730. It should be noted that the data in the sub-result mask tensor 730 may be fp32 type data or int32 type data.

As shown in fig. 7, the sub-result mask tensors 730 are sequentially spliced to obtain a final result mask tensor 740, wherein the data length of the final result mask tensor 740 in the width dimension is 64 (i.e., 32×2), and the data length in the height dimension is 512 (i.e., 256×2). That is, h_transferred=512 and w_transferred=64 of the final result mask tensor 740. Here, the data type of the data in the final result mask tensor 740 is the same as the data type of the data in the sub result mask tensor 730. For example, the data in sub-result mask tensor 730 is fp32 data, and correspondingly, the data in final result mask tensor 740 is also fp32 data.

As can be seen from the above, according to the embodiments of the present invention, when the original mask tensor has a larger tensor shape, the original mask tensor may be cut first, and then the scheme for mask tensor conversion according to the present invention is performed on the cut sub-original mask tensors, so that the technical effect of compressing the size of the original mask tensor to reduce the pressure of the board bandwidth can be achieved. In addition, the final result mask tensor 740 obtained by sequential stitching may be directly operated after being loaded into hardware, for example, loaded into the computing unit 530 of the system 500 to perform data computation based on the masked_fill function, thereby improving the efficiency of data computation.

In step 606, an output tensor is determined by the computing unit 530 of the system 500 based on the input tensor, the result mask tensor, and the fill value.

With respect to determining the output tensor, it may refer to performing a data calculation based on a masked_fill function to determine the output tensor. For example, the input tensor, the result mask tensor, and the fill value may be loaded to the mask_fill function, and the output tensor may be determined by performing data calculations based on the mask_fill function. Specifically, after the result mask tensor is loaded into hardware (e.g., the computing unit 530 of the system 500), the operations on the masked_fill function may be performed on the input tensor data by, for example, selecting/replacing the operations based on the bit data. In this case, each time a data selection/replacement operation is done for an input tensor, it is necessary to discard bit data at a current operation position by a shift operation and shift subsequently unused bit data to the current operation position so as to perform, for example, a selection/replacement operation for the subsequently bit data, thereby determining an output tensor. That is, based on the input tensor, the result mask tensor, and the padding value, the output tensor may be determined by executing the combination of the sel instruction and the shl instruction, and the determined output tensor is stored.

At step 608, the output tensor is output by the output unit 540 of the system 500.

As can be seen from the above, according to the scheme for data calculation according to the embodiment of the present invention, the original mask tensor is rearranged, compressed and converted into the result mask tensor before the data calculation is performed, and then the result mask tensor is loaded into the calculation unit for data calculation, so that the data blocking in the data loading stage can be avoided, and the performance of data calculation on the board card can be greatly improved. And the result mask tensor obtained through conversion can be directly subjected to data calculation after being loaded into the calculation unit, so that the efficiency of data calculation is improved.

FIG. 8 schematically illustrates a block diagram of a computing device 800 suitable for use in implementing embodiments of the present invention. Device 800 may be a device for implementing the method 200 shown in fig. 2 and the method 600 shown in fig. 6. As shown in fig. 8, device 800 includes a processing unit 801 (including but not limited to a Central Processing Unit (CPU), a Graphics Processor (GPU), a General Purpose Graphics Processor (GPGPU), and any combination of the foregoing), which may perform various suitable actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM) 802 or loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The processing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in device 800 are connected to I/O interface 805, including: an input unit 806, an output unit 807, a storage unit 808, and a processing unit 801 perform the various methods and processes described above, such as performing method 200 or method 600. For example, in some embodiments, the method 300, 400, or 600 may be implemented as a computer software program stored on a machine readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM 802 and/or communication unit 809. When a computer program is loaded into RAM 803 and executed by processing unit 801, one or more of the operations of method 200 or method 600 described above may be performed. Alternatively, in other embodiments, processing unit 801 may be configured to perform one or more actions of method 200 and/or method 600 by any other suitable means (e.g., by way of firmware).

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

Computer program instructions for carrying out operations of the present invention may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information for computer readable program instructions, which can execute the computer readable program instructions.

These computer readable program instructions may be provided to a processor in a voice interaction device, a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner.

The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

The above is only an alternative embodiment of the present invention and is not intended to limit the present invention, and various modifications and variations will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for masking tensor conversion, comprising:

converting the original mask tensor into a first intermediate mask tensor based on a predetermined conversion rule, wherein data in the first intermediate mask tensor is bit-type data;

grouping the data of the first intermediate mask tensor in a first dimension to obtain a plurality of initial data chunks; and

rearranging data in the plurality of initial data blocks to obtain a second intermediate mask tensor, wherein the second intermediate mask tensor comprises a plurality of rearranged data blocks, wherein each rearranged data block comprises a portion of the data in each initial data block.

2. The method of claim 1, wherein the data in the original mask tensor is boolean data, wherein converting the original mask tensor to the first intermediate mask tensor based on the predetermined conversion rule comprises:

Converting the Boolean data with the value of true in the original mask tensor into bit type data with a first value; and

and converting the Boolean data with the false value in the original mask tensor into bit type data with a second value.

3. The method of claim 1, wherein grouping the data of the first intermediate mask tensor in the first dimension to obtain a plurality of initial data blocks comprises:

determining a data length for the packet; and

based on the determined data length for grouping, the data of the first intermediate mask tensor is grouped in the first dimension such that the sizes of the plurality of initial data blocks obtained after grouping are the same.

4. A method according to claim 3, characterized in that the data length for the packet is related to the type of data in the resulting mask tensor to be obtained.

5. The method of claim 3, wherein grouping the data of the first intermediate mask tensor in the first dimension to obtain a plurality of initial data blocks further comprises:

in response to the data length of the first intermediate mask tensor in the first dimension not being divisible by the data length for the packet, the data of the first intermediate mask tensor in the first dimension is padded such that the data length of the padded first intermediate mask tensor in the first dimension is divisible by the data length for the packet.

6. The method of claim 1, wherein rearranging the data in the plurality of initial data blocks to obtain the second intermediate mask tensor comprises:

determining, for each initial data block, a first index value of the initial data block in the first dimension and a second index value in the second dimension;

determining, for each data in the initial data block, a third index value for the data in the initial data block; and

a position of the data in the second intermediate mask tensor is determined based on at least the first and second index values of the initial data block and a third index value of the data in the initial data block to move the data to a corresponding position of a corresponding rearranged data block.

7. The method of claim 6, wherein determining the location of the data in the second intermediate mask tensor comprises:

determining a rearranged data block corresponding to the data based on a third index value of the data in the initial data block and a second index value of the initial data block; and

and determining the position of the data in the determined rearranged data block based on the first index value and the second index value of the initial data block.

8. The method as recited in claim 1, further comprising:

the second intermediate mask tensor is compressed such that the data in the second intermediate mask tensor is converted into integer-type data or floating-point-type data to obtain a result mask tensor.

9. A method for data computation, comprising:

receiving input parameters, the input parameters comprising: inputting tensors, original mask tensors and filling values;

the method of any of claims 1-8 converting the original mask tensor to obtain a resultant mask tensor; and

an output tensor is determined based on the input tensor, the result mask tensor, and the fill value.

10. A computing device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor;

the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of any one of claims 1-9.

11. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-9.

12. A computer program product, characterized in that it is tangibly stored on a non-transitory computer-readable medium and comprises machine-executable instructions which, when executed, cause a machine to perform the steps in the method according to any one of claims 1-9.