CN115759294B

CN115759294B - Data processing method, device, electronic equipment and storage medium

Info

Publication number: CN115759294B
Application number: CN202211488824.9A
Authority: CN
Inventors: 熊昆; 张留杰; 刘红雨; 蓝翔
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-11-25
Filing date: 2022-11-25
Publication date: 2023-10-24
Anticipated expiration: 2042-11-25
Also published as: CN115759294A

Abstract

The disclosure provides a data processing method, a data processing device, electronic equipment and a storage medium, relates to the technical field of artificial intelligence, and particularly relates to the technical fields of machine learning, natural language processing, computer vision, protein structure prediction and the like. The specific implementation scheme is as follows: acquiring a first tensor required by a first einstein operator; calling the kernel to perform transposition on the first tensor to obtain a first tensor transposition result and storing the first tensor transposition result in a target storage medium; generating subscripts of a first tensor of a second Einstein operator according to a preset marking rule in a planning and scheduling stage; reading the first tensor transposition result from the target storage medium under the condition that the first tensor transposition result needs to be multiplexed; and performing operation of a second Einstein operator based on the read first tensor transposition result and the subscript of the first tensor. According to the embodiment of the disclosure, the first tensor transposition result of the first Einstein operator is multiplexed, so that the call kernel is prevented from being started for multiple times, the resource consumption is reduced, and the operation efficiency is improved.

Description

Data processing method, device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly to the technical fields of machine learning, natural language processing, computer vision, and protein structure prediction.

Background

Einsum (Einstein summation convention, einstein summing convention), also known as Einstein labeling or Einstein operator, can easily represent various linear operations.

In the deep learning task, the Linear operation is widely applied, and can be applied to a Linear layer, a Matmul (matrix multiplication) layer and the like. Thus, the einstein operator is very important and can greatly accelerate the model design and implementation for users.

In the related art, the operation of the einstein operator needs to call the kernel supporting the corresponding operator, but the starting and the calling of the kernel consume a great deal of computing resources and time.

Disclosure of Invention

The disclosure provides a data processing method, a data processing device, electronic equipment and a storage medium.

According to an aspect of the present disclosure, there is provided a data processing method including:

acquiring a first tensor required by a first einstein operator;

calling the kernel to perform transposition operation on the first tensor to obtain a transposition result of the first tensor;

storing the first tensor transpose result into a target storage medium;

generating a subscript of the first tensor of the second einstein operator according to a preset marking rule in the stage of planning and scheduling the first tensor of the second einstein operator; the preset marking rule meets the requirement of multiplexing the first tensor transposition result;

Reading the first tensor transpose result from the target storage medium in the event that the second einstein operator needs to multiplex the first tensor transpose result;

and executing the operation process of the second Einstein operator based on the read transposed result of the first tensor and the subscript of the first tensor of the second Einstein operator.

According to another aspect of the present disclosure, there is provided a data processing apparatus including:

a first acquisition module for acquiring a first tensor required by a first einstein operator;

the calling module is used for calling the kernel to perform transposition operation on the first tensor to obtain a transposition result of the first tensor;

a first storage module for storing the first tensor transpose result into a target storage medium;

the generation module is used for generating subscripts of the first tensor of the second Einstein operator according to a preset marking rule in the stage of planning and scheduling the first tensor of the second Einstein operator; the preset marking rule meets the requirement of multiplexing the first tensor transposition result;

the reading module is used for reading the first tensor transposition result from the target storage medium under the condition that the second einstein operator needs to multiplex the first tensor transposition result;

And the execution module is used for executing the operation process of the second Einstein operator based on the read transposed result of the first tensor and the subscript of the first tensor of the second Einstein operator.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform a method according to any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method according to any of the embodiments of the present disclosure.

In the embodiment of the disclosure, the first tensor transposition result of the first einstein operator is multiplexed in the calculation process of the second einstein operator, so that the call kernel is prevented from being started for multiple times, the resource consumption is saved, and the operation efficiency is improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow diagram of a data processing method according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow diagram of transpose of a resulting gradient to a target gradient in accordance with another embodiment of the present disclosure;

FIG. 3a is a schematic diagram of one example of a data processing method according to another embodiment of the present disclosure;

FIG. 3b is a schematic diagram of another example of a data processing method according to another embodiment of the present disclosure;

FIG. 4 is a schematic illustration of an application of a data processing method in the field of natural language processing according to another embodiment of the present disclosure;

FIG. 5 is a schematic illustration of an application of a data processing method in the field of protein structure prediction according to another embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a data processing apparatus according to another embodiment of the present disclosure;

fig. 7 is a block diagram of an electronic device for implementing a data processing method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The terms "first," "second," and the like in embodiments of the present disclosure are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion, such as a series of steps or elements. The method, system, article, or apparatus is not necessarily limited to those explicitly listed but may include other steps or elements not explicitly listed or inherent to such process, method, article, or apparatus.

Einstein operator can easily represent various linear operations. For example: ij- > ji may represent the transpose, ij, jk- > ik may represent the matrix multiplication, or may represent the outer product, the inner product, etc. The memory burden of the user can be greatly reduced by summing all linear operations with a unified einstein operator.

In the deep learning framework, two-operand based einstein operator accepts mainly 3 parameters: 2 input operands (e.g., tensor a and tensor B), and a string calculation expression, and finally outputting a tensor C. For example, in the expression c=pad.einsum (a, b, "ij, jk- > ik"), a and b are one operand, respectively, ij, jk- > ik is a calculation expression, c is an output result of the einstein operator, and pad.einsum can be understood as a method of executing the einstein operator.

In general, the forward implementation of almost all einstein operators can be logically divided into two parts: planner (stage of planning scheduling) and Executor (stage of execution). The Planner section is how to combine correlation operations, such as transfer, matmul, sum, etc., by means of an expression. A large einstein operator is therefore decomposed into a very large number of small calculations, which are performed in a certain order to obtain the final result.

There are two currently dominant einstein operator implementations. One is Trace mode, and the other is combination mode.

The combination approach typically implements a highly efficient c++ end einstein operator forward function called einstein kernel, which is then called when the gradient needs to be calculated. Wherein the eisumgradkernal function completes the derivation of the 2 operands (e.g., tensor a and tensor B) of the input by calling 2 times eisumkernal. The Einstein operator is realized by using a combination mode, so that good modularization can be achieved, forward logic can be shared in the reverse direction, and the implementation is simple and is beneficial to subsequent optimization.

The Trace mode is simpler than the combination mode, and is specifically realized as follows: the operator sequence participating in the calculation is recorded during the forward calculation, and the corresponding reverse operator is recalled once in the reverse order of the forward calculation. For example, the Transpore operator- > Matmul operator is called in the forward direction, and then the Matmul gradient operator- > Transpore gradient operator is called in the reverse direction.

The Trace approach has the advantage of fast reverse speed relative to the combined approach, but has the disadvantage of being complex to implement and highly dependent on the underlying components provided by the framework. If the framework does not support the presence of reverse Op (operation code) and does not provide the basic Trace component, the implementation cost will be quite enormous. Because the reverse Trace component to Op is almost the core mechanism of a deep learning framework.

In view of this, the embodiments of the present disclosure reform the combining manner so as to increase the speed of executing the einstein operator by the combining manner, so as to reduce the consumption of the core resource. In order to achieve the objective, the embodiments of the present disclosure propose a technical concept that a combination manner can multiplex forward propagation, so as to expect to accelerate the operation speed of einstein operators in the combination manner by reducing the number of forward call kernels, and achieve the purpose of saving computational resources.

Based on the technical concept, the present disclosure proposes a data processing method, which can be applied to any scene where an einstein operator needs to be executed, such as the fields of natural language processing, machine vision, protein structure prediction, and the like.

As shown in fig. 1, a flowchart of a data processing method in an embodiment of the disclosure includes:

s101, acquiring a first tensor required by a first Einstein operator.

Wherein the first tensor is the operand of the first einstein operator, such as the a tensor, the B tensor in the previous examples.

S102, calling the kernel to perform transposition operation on the first tensor to obtain a transposition result of the first tensor.

S103, storing the first tensor transpose result in the target storage medium.

The target storage medium may be, for example, a cache.

S104, generating subscripts of the first tensor of the second Einstein operator according to a preset marking rule in the stage of planning and scheduling the first tensor of the second Einstein operator. The preset marking rule meets the requirement of multiplexing the first tensor transposition result.

As set forth above, any einstein operator includes a Planner stage and an Executor stage. Wherein the einstein operator is to be executed by first marking the subscript of the einstein operator in the Planner stage and then performing an operation in the exectors stage based on the marked result. As shown in table 1, an example of marking the subscripts. Table 1 includes a set and subscripts under the set. Where there are 2 operands, A and B respectively, for example, the evaluation mik, mjk- > mij. The Planner stage classifies and marks the subscript first, and there are 4 common categories, which may include:

batch (hereinafter also referred to as first set of marks);

free (hereinafter also referred to as the second set of markers or the third set of markers);

contracttion and reduction.

In the einstein operator, a and B are operands, respectively, and O is the output result. Accordingly, batch represents a subscript common to A, B, and O. free indicates that only AO or a subscript in BO is present. Contraction indicates a subscript that exists in both A and B, but does not exist in O. Reduction indicates the subscript that exists only in a or B.

According to this classification rule, m belongs to batch, i and j belong to free category, and k belongs to connection category in the rule of evaluation mik, mjk- > mij. For reduction, e.g., einsum ("i, j- > j", a, B), when there is i in the input parameters, but there is no i in the output result after operation, the reduction for i is indicated.

ABO	AO	BO	AB	A	B
						batch	freeA	freeB	contraction	reduction	reduction

TABLE 1

In the embodiment of the disclosure, the second einstein operator can multiplex the forward propagation result of the first einstein operator by planning the marking rule of the Planner stage. For example, during the computation of the second einstein operator, the kernel may be invoked to perform the same operations as during the computation of the first einstein operator. For example, both the operands of the second einstein operator and the first einstein operator include a first tensor, and the transpose of the first tensor is required in the calculation of both einstein operators. Then, the labeling rules may be preset so that the first tensor in the second einstein operator may multiplex the first tensor transpose result of the first einstein operator. At this time, if the first tensor transpose result is required in the second einstein operator calculation, the first tensor transpose result can be directly read from the target storage medium to satisfy the calculation requirement of the second einstein operator. Therefore, the first tensor transposition result is directly reused, the transposed kernel is not repeatedly called, and the transposition operation is repeatedly executed, so that the intermediate calculation process of part of the second Einstein operator is omitted.

Thus, S105 may be performed, reading the first tensor transpose result from the target storage medium in case the second einstein operator needs to multiplex the first tensor transpose result.

And S106, executing an operation process of the second Einstein operator based on the read transposed result of the first tensor and the subscript of the first tensor of the second Einstein operator.

Each time a transpose result of the first tensor is desired in the related art, a kernel needs to be started and called to perform a transpose operation on the first tensor. However, in actual situations, the cost of calling the kernel to execute the related operation is very high, and multiple times of calling the kernel can seriously affect the computing efficiency, thereby increasing the hardware burden. In the embodiment of the disclosure, the first tensor transposition result of the first einstein operator is multiplexed in the calculation process of the second einstein operator, so that the kernel is prevented from being started and called for multiple times, the resource consumption is greatly reduced, and the calculation efficiency is improved.

In some embodiments, the first einstein operator includes two operands therein, and the first tensor is any one of the two operands. That is, a first operand of a first einstein operator may be treated as a first tensor, and the transposed result of its first operand stored for multiplexing by a second einstein operator. The second operand of the first einstein operator may also be considered as the first tensor, and the transposed result of its second operand stored for multiplexing by the second einstein operator. The first operand and the second operand of the first einstein operator may also be respectively treated as a first tensor, and the transposed result of each of the first operand and the second operand is stored for multiplexing by the second einstein operator.

In the embodiment of the disclosure, any operand in the einstein operator is used as the first tensor, and the method is not limited to limiting a specific operand to the first tensor, so that the target storage medium can be reasonably used as required. In addition, the first tensor transposition result is multiplexed, so that the number of times of calling the kernel can be effectively reduced, and the resource consumption is reduced. In addition, in the model training stage, the first tensor is determined according to the requirement, so that the data processing process can be more flexible and efficient.

For ease of understanding, the generation of preset marking rules in embodiments of the present disclosure, and the details of the rules, are described in detail below.

For example, in reverse, the forward direction may be used twice to make the calculation. For example, when o=pad.einsum ("ibnd, jbind- > bnij", a, B) is calculated in reverse, da=pad.einsum ("ibnd, jbind- > bnij", di, B) may be used, i.e., operand a is replaced with di, while an estimate of O is used to replace an estimate of operand a, which corresponds to the interchange of operands a and O. If multiplexing the transposed result is desired, it is considered that in three EinsumKernel substitutions, multiplexing can be achieved if the subscripts for TA1, TA2, TA3 are identical. The following descriptions of the reasons for multiplexing are analyzed by the fact that the subscripts of TA1, TA2, and TA3 are identical, and the following three forward calls are taken as examples, and the expressions are 1), 2), and 3), respectively):

O＝paddle.einsum(A，B)1)

dA＝paddle.einsum(B，dO)2)

B＝paddle.einsum(A，dO)3)

Wherein TA1, TA2, TA3 correspond to the transposed subscripts of operand a contained in expressions 1), 2), 3), respectively.

Because operand A does not participate in the calculation as an input operand in expression 2), expression 2) does not require multiplexing the transpose of operand A. However, operand A is an input operand in expression 3), so it is desirable that expression 3) be able to multiplex the transpose of operand A in expression 1). Since TA1 in operand A is 1) is: ABO, AO, AB. TA3 in operand a 3) is: ABO, AB, AO. It can be seen that TA1 and TA3 are not identical, so the transpose of operand A cannot be multiplexed. Thus, embodiments of the present disclosure enable expression 3) to multiplex the transpose of operand A in expression 1) by modifying the tagging rules.

In view of this, in the embodiment of the present disclosure, in a stage of planning and scheduling the first tensor of the second einstein operator, the subscript of the first tensor of the second einstein operator is generated according to a preset labeling rule, where the preset labeling rule includes:

(1) The ordering order of the same elements in the same set of labels of the first einstein operator and the second einstein operator is the same.

Wherein the set of labels is shown in table 1, set ABO, AO, BO, AB, A, B.

If the element contained in operand A is "ibnd", the element contained in operand B is "jbnd", and the element contained in O is "bnij". The order of elements in the same set in both expression 1) and expression 3) is "bn", instead of "bn" in the ABO set of expression 1), and becomes "nb" in the ABO set of expression 3). Thus, embodiments of the present disclosure require that the order of the same elements in the same label set be the same.

(2) The forward subscripts of different operands are not identical, namely:

where the first tensor is the first operand of the second einstein operator, the transposed index of the first operand satisfies the order of ABO, AO, AB. That is, the transpose of the first operand is obtained by sequentially concatenating the elements in the ABO, AO, AB set. For example, if the operand a is ibnd, the element included in the ABO set is bn, the element included in the AO set is i, and the element included in the AB set is d, the elements in the set ABO, AO, AB are sequentially spliced to obtain the transpose bnid of the operand a.

Where the first tensor is the second operand of the second einstein operator, the transposed index of the second operand satisfies the order of ABO, AB, BO. That is, when the operand is the second operand, its transpose is obtained by concatenating the elements in the ABO, AO, AB set in sequence. For example, if the operand a is ibnd, the element included in the ABO set is bn, the element included in the AO set is i, and the element included in the AB set is d, the elements in the set ABO, AO, AB are sequentially spliced to obtain the transpose bnid of the operand a.

Wherein the meaning of the ABO, AO and BO sets is the same as the previous expression, e.g. ABO is the first set of labels, the elements in ABO are contained in two operands of the second einstein operator and in the output result of the second einstein operator.

AO is a second set of tokens, elements in AO being contained in a first operand of a second einstein operator and in an output result of the second einstein operator.

BO is a third set of tokens, and elements in BO are contained in a second operand of a second Einstein operator and in an output result of the second Einstein operator.

In the embodiment of the disclosure, the second einstein operator meets the condition of multiplexing the first tensor transposition result in the first einstein operator by formulating the preset marking rule, so that the reusability of the first tensor transposition result is ensured, and the condition is provided for simplifying the operation process of the second einstein operator. Caching the transposed results to the storage medium for multiplexing also reduces the number of calls to the transposed kernel.

In practice, all three transposes of the forward computation may be stored to the target storage medium. For example, the transpose TA of the operand a, the transpose TB of the operand B, and the transpose TdO of the output result O in expression 1 are stored for multiplexing.

In some implementations, where the first tensor is a first operand of a first einstein operator and the second einstein operator is used to determine a gradient of a second operand of the first einstein operator, to enable as many transposes of multiplexing forward computations as possible, in embodiments of the present disclosure the expression of the second einstein operator is determined to be a first target expression, the first target expression being as shown in equation 4):

dO×A->dB 4)

in equation 4), dO represents the gradient of the output result of the first Einstein operator, A represents the first tensor, and dB represents the gradient of the second operand of the first Einstein operator.

Taking the read first tensor transposition result as transposition of A in a first target expression, taking the transposition of the gradient of the output result of a first Einstein operator read from a target storage medium as transposition of dO in the first target expression, and executing the first target expression based on the subscript of the first tensor of a second Einstein operator to obtain the gradient of the second operand of the first Einstein operator.

Thus, in the disclosed embodiments, in order to be able to multiplex the transpose of the forward stages as much as possible, the first target expression is carefully designed to solve for the gradient of the second operand of the first einstein operator. Therefore, the number of times of calling the kernel is reduced by increasing the number of multiplexing transposes, so that the computing resource is saved.

Similarly, in the case where the first tensor is the second operand of the first einstein operator and the second einstein operator is used to determine the gradient of the first operand of the first einstein operator, the expression of the second einstein operator is determined to be the second target expression, which is the same as expression 2), that is: bxdo- > d A, where dO represents the gradient of the output result of the first einstein operator, B represents the first tensor, dA represents the gradient of the first operand of the first einstein operator.

Taking the read first tensor transposition result as transposition of B in a second target expression, taking the transposition of the gradient of the output result of the first Einstein operator read from the target storage medium as transposition of dO in the second target expression, and executing the second target expression based on the subscript of the first tensor of the second Einstein operator to obtain the gradient of the first operand of the first Einstein operator.

Thus, in the disclosed embodiments, to be able to multiplex the transpose of the forward phase as much as possible, a second target expression is employed to solve for the gradient of the second operand of the first einstein operator. Therefore, the number of times of calling the kernel is reduced by increasing the number of multiplexing transposes, so that the computing resource is saved.

To sum up, in order to be able to multiplex the three transposes in expression 1) as much as possible, in the embodiment of the present disclosure, expression 3) in the original forward calculation process is modified to expression 4), so that the three transposes of expression 1) can be multiplexed in both the backward phase expression 2) and expression 4).

In some embodiments, the gradient of the finalized operand may not be suitable for subsequent computations due to the transpose operation and the adoption of preset labeling rules. To this end, in embodiments of the present disclosure, the resulting gradient may be transposed to a target gradient that is available for subsequent computation by a transpose operation. As shown in fig. 2, it can be implemented as:

s201, in the case that the second einstein operator performs the inverse operation with respect to the first einstein operator, the intermediate gradient is stored in the target storage medium.

S202, reading the intermediate gradient from the target storage medium, and calling the kernel to transpose the subscript of the target gradient according to the subscript of the corresponding target operand to obtain the target gradient of the target operand.

Wherein:

in the case that the target operand is the first operand of the first einstein operator, the gradient of the first operand of the first einstein operator determined by the second einstein operator is an intermediate gradient;

In the case where the target operand is the second operand of the first einstein operator, the gradient of the second operand of the first einstein operator determined by the second einstein operator is an intermediate gradient.

For example, calculating the subscript of the intermediate gradient dB as bndj by multiplexing TdO and TA, based on expression 1) shows that the subscript of the desired target gradient dB is jbnd, and the subscript of the intermediate gradient is converted into the subscript order of the target gradient by the transpose operation, that is, the bndj is transposed to jbnd.

In some embodiments, the target storage medium has stored therein at least a gradient of a first operand of a first einstein operator determined by a second einstein operator and a gradient of a second operand of the first einstein operator determined by the second einstein operator. When a transpose operation is performed on the intermediate gradient, the gradient of the first operand may be read from the target storage medium, or the gradient of the second operand may be read, or both the gradient of the first operand and the gradient of the second operand may be read.

In the embodiment of the disclosure, an intermediate gradient obtained after the second einstein operator performs inverse operation with respect to the first einstein operator is stored in a target storage medium, so that the intermediate gradient is multiplexed later and transposed into a target gradient, so that the accuracy of calculation is improved.

It should be noted that, whether multiplexing the transposed result is implemented by adopting the scheme in the embodiment of the present disclosure, the transposed operation needs to be performed on the obtained dB and dA in the related art, so that, in the embodiment of the present disclosure, the number of times of calling the kernel is not increased in terms of the overall flow.

In the disclosed embodiments, in addition to the case where the einstein operator has two operands, the case where the einstein operator may have multiple operands is also considered. When the einstein operator has multiple operands, it can be extended to implement the following method to achieve multiplexing of transposed results:

in the case where the target einstein operator includes n operands in an ordered arrangement, the target einstein operator is disassembled into a plurality of first einstein operators for ordered execution based on the following method:

and determining two operands which are sequenced into a first position and a second position in the n operands as a first operand of a first Einstein operator corresponding to the operand in the second position, and obtaining an output result of the first Einstein operator corresponding to the operand in the second position.

In the case where there is an unprocessed operand among the n operands, it is determined that the first operand of the unprocessed operands is the target operand. And respectively taking the output result of the first Einstein operator corresponding to the last operand of the target operand and the target operand as the first operand and the second operand of the first Einstein operator corresponding to the target operand to obtain the output result of the first Einstein operator corresponding to the target operand.

In some embodiments, for example, the target einstein operator has three operands, D, E, F respectively, where two operands D and E ordered into a first position and a second position are used as two operands of the first einstein operator corresponding to the operand E in the second position, so as to obtain the output result of the first einstein operator corresponding to the operand E in the second position.

At this time, there is also an unprocessed operand of F among the three operands D, E, F, and F should be the first target operand of the unprocessed operands because only one operand F remains here. The last operand corresponding to destination operand F is E. And further, respectively taking the output result of the first Einstein operator corresponding to E and the target operand F as a first operand and a second operand of the first Einstein operator corresponding to the target operand F, and continuously operating the output result of the first Einstein operator corresponding to E and the target operand F according to the processing procedures of the two operands, so as to obtain the output result of the first Einstein operator corresponding to the target operand F until no residual unprocessed operands exist, and ending the cycle.

In the embodiment of the disclosure, under the condition that the einstein operator possibly has a plurality of operands, the complex einstein operator can be disassembled into the einstein operator with only two operands, and the method for multiplexing the transposed result in the embodiment of the disclosure is executed, so that the number of times of calling the kernel is reduced, the computing resource is saved, the multiplexing of the output result of the einstein operator is realized, and the operation efficiency of the einstein operator can be improved.

To facilitate understanding of the embodiments of the present disclosure, an overall flow of the data processing method of the embodiments of the present disclosure is described with reference to fig. 3 a:

as shown in fig. 3a, a and B are input as two operands, the output C obtained after the first calculation of A, B is derived and dC is also the input number. And (3) performing first transposition on A, B and dC to obtain TA, TB and TdC in the middle layer, wherein the TA, TB and TdC can be multiplexed. Here, 3 tranpsos can be reduced. At the time of reversal, it is not necessary to repeatedly calculate TA, TB and TC. And continuing to perform BMM (BatchedMatMul, matrix multiplication batch processing) on TA, TB, tdC to obtain intermediate output results C, dB and dA. The intermediate output result at this time cannot be directly output and used, but needs to be transposed again to obtain transposed C, dB, dA as the final available output result.

For a specific calculation process, take the example shown in fig. 3b (note that the table header in fig. 3b represents a set of labels, as opposed to a specific operand):

if the forward direction is calculated as: o=pad.einsum ("ibnd, jbind- > bnij", a, B), since a is located in the first operand, when the forward index of the two operands is required to be inconsistent, the transpose of the first operand is: an ABO set, an AO set, an AB set; the transpose of the second operand is: ABO set, AB set, BO set. The subscript for TA1 is given by ABO|AO|AB- > bnid: bnid.

The reverse calculation is: dB = pad =einsum ("bnij, ibnd- > jbnd", di, a), since operand a is now located in the second operand, the transpose of which is: ABO set, AB set, BO set, and A, B and O in the header are respectively: a- > dO, B- > A, O- > dB. The transpose of the operand a is available as: abo|ab|bo- > oab|oa|ab- > bnid, it can be seen that ta2=bnid at this time.

In summary, by combining the forward calculation and the backward calculation, the subscripts of the corresponding operands are adjusted in the backward calculation according to the marking rule, and finally, ta1=ta2 is enabled to meet the multiplexing condition.

After using the above-described optimization method, it can be ensured that the inverse multiplexing TA, TB, TC must be correct. The reverse speed of the einstein operator can be smoothed to nearly the same level as the Trace mode. Experiments prove that the optimized reverse processing needs 35ms, and compared with the original combination mode of the non-multiplexing transposed result, the method needs 40ms reversely, and the improvement is 16%.

In conclusion, experiments prove that the embodiment of the disclosure can greatly improve the reverse calculation speed under the condition of consuming the same video memory by optimizing the einstein operator reverse propagation process. Meanwhile, under the condition of optimal theory, the back propagation process of the Einstein operator can be reduced by 6 kernel operations, including 3 Transpost operations and 3 ReduceSum operations.

Experiments have shown that the number of transgenes can be reduced to a very low level, letting A, B, O be 2 inputs and 1 output, dO being the gradient of the output. Since three matrices of O and dA, dB need to be obtained, the kernel computation times are: the theoretical minimum is 1 matmul+2 trancose forward (TA and TB were obtained for A and B). There are 2 matmul operations in the opposite direction, and one transfer to dO. Finally, 3 translucens are added to the output of fig. 3 a. So it is a 3-input Transpost+3-output Transpost+3 matmul. It should be noted that this value is theoretical, and in the worst case scenario, the solution of the embodiment of the present disclosure still can take advantage because the total number of calls to the kernel in the best planning algorithm in the related art is optimally this number. In addition, there are other situations, such as output or input does not need a transfer, and the method adopting the embodiment of the disclosure can further reduce the number of times of calling the kernel. In summary, in the embodiments of the present disclosure, each variable theoretically only needs to be transferred once, while the same variable in the related art may be transferred multiple times. If Reduction and broadcasting are combined, the result of reduction+transfer and the result of broadcasting+transfer can be saved, so that reducing the number of calls to the kernel through multiplexing is also applicable in the case of combining Reduction and broadcasting.

Taking the field of natural language processing as an example, the data processing method of the embodiment of the present disclosure will be described. As shown in fig. 4:

FIG. 4 is a XLNet (Generalized Autoregressive Pretraining for Transformer-XL, transducer-XL expansion) model, with einsum operator applied in Relative Attention (relative attention) of XLNet. A specific application is in Two Masked Two-stream attention modules in fig. 4. Specifically, the einsum operator combines multiple complex linear operations on the text input's emmbending tensor, and finally outputs the attention weight (output attentions).

Taking the field of protein structure prediction as an example, the data processing method of the embodiment of the present disclosure will be described. For example, in the gray area of fig. 5, the model includes: PLM (Protein Language Model ), adaptors and geometric models. Based on the protein structure prediction model, predicting the structure of the candidate protein can be implemented as shown in the gray region of fig. 5:

constructing primary structure information and attention map of candidate proteins based on a protein language model in the protein structure prediction model; among them, the disclosed examples used 3 hundred million single sequences (as in the gray region of FIG. 5-300M primary sequences) to train the PLM model. So that the PLM can accurately extract primary structure information and attention patterns.

And inputting the primary structure information and attention of the candidate protein into an adapter layer (adapter) of a protein structure prediction model to obtain the secondary structure information of the candidate protein. As shown in the gray area of fig. 5, the dashed box behind the adapter layer shows the secondary structure information including single sequence representation 1 (sine repr.) and pair representation 1 (pair repr.). Wherein the adapter layer may comprise two linear layers, wherein inputting one of the linear layers results in a single sequence representation 1 in the secondary structure information, taking care to strive to input the other linear layer results in a pairing representation 1 in the secondary structure information.

And inputting the secondary structure information of the candidate protein into a geometric model to obtain the tertiary structure information of the candidate protein. As shown in the gray area of fig. 5, the geometric model may be Geometric Modeling (geometric model) of the AlphaFold model. So that the structure of the candidate protein can be accurately predicted using the structure prediction ability of the geometric model.

The original EvoFormer in the AlphaFold model uses the searched MSA (Measurement Systems Analysis, measurement system analysis) as input. Alternatively, in the embodiment of the present disclosure, the output of the adapter layer is used as the MSA, so that the process of searching the MSA is omitted, and the prediction speed is improved. Second, evofermer in the disclosed embodiments employs various attention mechanisms to exchange information in the single sequence representation and the paired representation to learn spatial relationships.

In the disclosed embodiment, a Structure Module (Structure Module) adopts a single-sequence representation and a pairing representation generated by EvoFormer, and realizes 3D coordinates of atoms in an end-to-end prediction butt joint Structure by using invariant point attention and other geometric transformation operators.

The disclosed embodiments train the PLM model with 3 hundred million single sequences (as in the gray area of fig. 5-300M primary sequences). Since relying solely on PLM predicted structures is not sufficient to adequately capture the required feature information, the PLMBase (i.e. PLM) and Geometric Modeling modules in the protein structure prediction model (HelixFold-Single) are jointly optimized. Optimization training was performed using 10 ten thousand experimentally determined protein structures (e.g., -1M estimated structures in the gray area of fig. 5). While training using an additional one million estimated protein structures (e.g., -120K determined structures in the gray area of fig. 5). The network is trained end-to-end using primary losses, including Frame Alignment Point Error (FAPE) losses and other secondary losses. By combining a computationally efficient PLMBase module (as compared to MSA search) with a Geometric Modeling module, helixFols-Single can provide efficient and accurate protein structure prediction.

Using einstein operators in the protein structure prediction model shown in the gray region of fig. 5, the input represents the sequence of the protein, i.e., the operands of the einstein operators are protein sequences, and by combining a plurality of einstum operators, the attention weight (i.e., the attention map in the gray region of fig. 5) is output.

In the field of protein structure prediction, the transposition of a forward-propagating protein sequence is multiplexed by the method of the embodiment of the disclosure, so that the number of starting and calling of a kernel can be reduced, the computing resource is saved, and the model pre-training efficiency is improved.

Taking the field of machine vision as an example, a data processing method of an embodiment of the present disclosure will be described. For example, in automatic driving control, an image of the surrounding environment of the vehicle is acquired through an image acquisition device of the vehicle, features of the surrounding environment are extracted from the image through a feature extraction module in the visual model, and when the automatic driving control is implemented, the extracted features can be used as operands of einstein operators, and the transpose of the features can be cached, so that in the process of subsequently adjusting the model parameters of the visual model, the cached transpose results are multiplexed when the gradient of solving the model parameters is back propagated.

In the field of machine vision, the method disclosed by the embodiment of the invention multiplexes the transpose of the forward-propagating image features, can reduce the number of times of calling the kernel, saves the computing resources and improves the model pre-training efficiency.

It should be noted that, multiplexing the buffered transposed results when solving the gradient of the model parameters in the back propagation is not only applicable to the machine vision field, but also applicable to other fields. Such as the fields of natural language processing and protein structure prediction.

Based on the same technical concept, the embodiments of the present disclosure further provide a data processing apparatus, as shown in fig. 6, including:

In some embodiments, the preset labeling rules include:

the ordering order of the same elements in the same label set of the first Einstein operator and the second Einstein operator is the same;

in the case where the first tensor is the first operand of the second einstein operator, the transposed index of the first operand satisfies the order of ABO, AO, AB;

where the first tensor is the second operand of the second einstein operator, the transposed index of the second operand satisfies the order of ABO, AB, BO:

wherein ABO is a first set of labels, elements in ABO being contained in two operands of a second einstein operator and in an output result of the second einstein operator;

AO is a second set of tokens, elements in AO being contained in a first operand of a second einstein operator and in an output result of the second einstein operator;

In some embodiments, the execution module is to:

in the case where the first tensor is a first operand of a first einstein operator and the second einstein operator is used to determine a gradient of a second operand of the first einstein operator, determining an expression of the second einstein operator as a first target expression, the first target expression being: dO x A- > dB, wherein dO represents the gradient of the output result of the first Einstein operator, A represents the first tensor, dB represents the gradient of the second operand of the first Einstein operator;

In some embodiments, the execution module is further to:

in the case where the first tensor is a second operand of the first einstein operator and the second einstein operator is used to determine a gradient of the first operand of the first einstein operator, determining an expression of the second einstein operator as a second target expression, the second target expression being: bxdo- > dA, where dO represents the gradient of the output result of the first einstein operator, B represents the first tensor, dA represents the gradient of the first operand of the first einstein operator;

In some embodiments, the data processing apparatus further comprises:

the second storage module is used for storing the intermediate gradient into the target storage medium under the condition that the second Einstein operator performs reverse operation relative to the first Einstein operator;

the second acquisition module is used for reading the intermediate gradient from the target storage medium, and calling the kernel to transpose the subscript of the target gradient according to the subscript of the corresponding target operand to obtain the target gradient of the target operand;

wherein:

In some embodiments, the first einstein operator has two operands, further comprising a splitting module for:

Determining two operands which are sequenced into a first position and a second position in the n operands as a first operand of a first Einstein operator corresponding to the operand in the second position, and obtaining an output result of the first Einstein operator corresponding to the operand in the second position;

determining an operand of the unprocessed operands, which is ordered first, as a target operand in the case that the unprocessed operands exist in the n operands; and is combined with the other components of the water treatment device,

and respectively taking the output result of the first Einstein operator corresponding to the last operand of the target operand and the target operand as the first operand and the second operand of the first Einstein operator corresponding to the target operand to obtain the output result of the first Einstein operator corresponding to the target operand.

In some embodiments, the first einstein operator includes two operands therein, and the first tensor is any one of the two operands.

For descriptions of specific functions and examples of each module and sub-module of the apparatus in the embodiments of the present disclosure, reference may be made to the related descriptions of corresponding steps in the foregoing method embodiments, which are not repeated herein.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 7 illustrates a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile apparatuses, such as personal digital assistants, cellular telephones, smartphones, wearable devices, and other similar computing apparatuses. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the apparatus 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data required for the operation of the device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in device 700 are connected to I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the respective methods and processes described above, such as a data processing method. For example, in some embodiments, the data processing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 700 via ROM 702 and/or communication unit 709. When a computer program is loaded into RAM 703 and executed by computing unit 701, one or more steps of the data processing method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the data processing method by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above can be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions, improvements, etc. that are within the principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A data processing method, comprising:

acquiring a first tensor required by a first einstein operator;

calling an inner core to execute transposition operation on the first tensor to obtain a transposition result of the first tensor;

storing the first tensor transpose result into a target storage medium;

generating a subscript of the first tensor of the second einstein operator according to a preset marking rule in a stage of planning and scheduling the first tensor of the second einstein operator; wherein the preset labeling rule satisfies a requirement that the first tensor transpose result can be multiplexed, the preset labeling rule includes:

in the case that the first tensor is the first operand of the second einstein operator, the transposed index of the first operand satisfies the order of ABO, AO, AB;

in the case that the first tensor is the second operand of the second einstein operator, the transposed index of the second operand satisfies the order of ABO, AB, BO:

wherein ABO is a first set of labels, elements in ABO being contained in two operands of the second einstein operator and in an output result of the second einstein operator;

AO is a second set of tokens, elements in AO being contained in a first operand of the second einstein operator and in an output result of the second einstein operator;

BO is a third set of labels, elements in BO are contained in a second operand of the second Einstein operator and in an output result of the second Einstein operator;

reading the first tensor transpose result from the target storage medium if the second einstein operator needs to multiplex the first tensor transpose result;

And executing an operation process of the second Einstein operator based on the read transposed result of the first tensor and the subscript of the first tensor of the second Einstein operator.

2. The method of claim 1, a transpose of a gradient of an output result of a first einstein operator is also stored in the target storage medium, wherein performing an operation of the second einstein operator based on the read transpose result of the first tensor and a subscript of the first tensor of the second einstein operator comprises:

in the case where the first tensor is a first operand of the first einstein operator and the second einstein operator is used to determine a gradient of a second operand of the first einstein operator, determining an expression of the second einstein operator as a first target expression, the first target expression being: dO x A- > dB, wherein dO represents the gradient of the output result of the first Einstein operator, A represents the first tensor, dB represents the gradient of the second operand of the first Einstein operator;

taking the read first tensor transposition result as transposition of A in the first target expression, taking the transposition of the gradient of the output result of the first Einstein operator read from the target storage medium as transposition of dO in the first target expression, and executing the first target expression based on the subscript of the first tensor of the second Einstein operator to obtain the gradient of the second operand of the first Einstein operator.

3. The method of claim 1, a transpose of a gradient of an output result of a first einstein operator is also stored in the target storage medium, wherein performing an operation of the second einstein operator based on the read transpose result of the first tensor and a subscript of the first tensor of the second einstein operator comprises:

in the case where the first tensor is a second operand of the first einstein operator and the second einstein operator is used to determine a gradient of the first operand of the first einstein operator, determining an expression of the second einstein operator as a second target expression, the second target expression being: bxdo- > dA, wherein dO represents the gradient of the output result of the first einstein operator, B represents the first tensor, dA represents the gradient of the first operand of the first einstein operator;

taking the read transposed result of the first tensor as a transpose of B in the second target expression, taking a transpose of a gradient of an output result of a first Einstein operator read from the target storage medium as a transpose of dO in the second target expression, and executing the second target expression based on a subscript of the first tensor of the second Einstein operator to obtain a gradient of a first operand of the first Einstein operator.

4. The method of claim 1, further comprising:

storing an intermediate gradient in the target storage medium with the second einstein operator performing a reverse operation with respect to the first einstein operator;

reading the intermediate gradient from the target storage medium, and calling the kernel to transpose the subscript of the target gradient according to the subscript of the corresponding target operand to obtain the target gradient of the target operand;

wherein:

in the case that the target operand is the first operand of the first einstein operator, the gradient of the first operand of the first einstein operator determined by the second einstein operator is the intermediate gradient;

in the case that the target operand is a second operand of a first einstein operator, the gradient of the second operand of the first einstein operator determined by the second einstein operator is the intermediate gradient.

5. The method of claim 1, wherein the first einstein operator has two operands, further comprising:

in the case that the target einstein operator comprises an ordered arrangement of n operands, the target einstein operator is disassembled into a plurality of sequentially executed first einstein operators based on the following method:

Determining two operands sequenced into a first position and a second position in the n operands as a first operand of a first Einstein operator corresponding to the operand in the second position, and obtaining an output result of the first Einstein operator corresponding to the operand in the second position;

determining that an operand which is sequenced first in the unprocessed operands is a target operand under the condition that unprocessed operands exist in the n operands; and is combined with the other components of the water treatment device,

6. The method of any of claims 1-5, wherein two operands are included in the first einstein operator, the first tensor being any one of the two operands.

7. A data processing apparatus comprising:

the calling module is used for calling the kernel to perform transposition operation on the first tensor to obtain a first tensor transposition result;

A first storage module configured to store the first tensor transpose result into a target storage medium;

the generation module is used for generating subscripts of the first tensor of the second Einstein operator according to a preset marking rule in the stage of planning and scheduling the first tensor of the second Einstein operator; wherein the preset labeling rule satisfies a requirement that the first tensor transpose result can be multiplexed, the preset labeling rule includes:

a reading module, configured to read the first tensor transpose result from the target storage medium when the second einstein operator needs to multiplex the first tensor transpose result;

8. The apparatus of claim 7, wherein a transpose of the gradient of the output result of the first einstein operator is also stored into the target storage medium, wherein the execution module is to:

9. The apparatus of claim 7, wherein a transpose of the gradient of the output result of the first einstein operator is also stored into the target storage medium, wherein the execution module is to:

10. The apparatus of claim 7, further comprising:

a second storage module for storing an intermediate gradient into the target storage medium in the case of a reverse operation of the second einstein operator with respect to the first einstein operator;

wherein:

11. The apparatus of claim 7, wherein the first einstein operator has two operands, further comprising a splitting module to:

12. The apparatus of any of claims 7-11, wherein two operands are included in the first einstein operator, the first tensor being any one of the two operands.

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-6.