CN115759294A - Data processing method and device, electronic equipment and storage medium - Google Patents

Data processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115759294A
CN115759294A CN202211488824.9A CN202211488824A CN115759294A CN 115759294 A CN115759294 A CN 115759294A CN 202211488824 A CN202211488824 A CN 202211488824A CN 115759294 A CN115759294 A CN 115759294A
Authority
CN
China
Prior art keywords
einstein
operator
operand
target
tensor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211488824.9A
Other languages
Chinese (zh)
Other versions
CN115759294B (en
Inventor
熊昆
张留杰
刘红雨
蓝翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202211488824.9A priority Critical patent/CN115759294B/en
Publication of CN115759294A publication Critical patent/CN115759294A/en
Application granted granted Critical
Publication of CN115759294B publication Critical patent/CN115759294B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Complex Calculations (AREA)

Abstract

The present disclosure provides a data processing method, an apparatus, an electronic device and a storage medium, which relate to the technical field of artificial intelligence, and in particular to the technical fields of machine learning, natural language processing, computer vision, protein structure prediction, and the like. The specific implementation scheme is as follows: acquiring a first tensor required by a first Einstein operator; invoking an inner core to perform transposition on the first vector to obtain a first vector transposition result and storing the first vector transposition result in a target storage medium; generating subscripts of a first tensor of a second Einstein operator according to a preset marking rule in a planning and scheduling stage; under the condition that the first quantum transposition result needs to be multiplexed, reading the first quantum transposition result from the target storage medium; the operation of the second einstein operator is performed based on the read first tensor transposed result and the subscript of the first tensor. In the embodiment of the disclosure, by multiplexing the first scalar transposition result of the first einstein operator, the kernel is prevented from being called by starting for many times, the resource consumption is reduced, and the operation efficiency is improved.

Description

Data processing method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of artificial intelligence, and more particularly to the field of machine learning, natural language processing, computer vision, and protein structure prediction.
Background
Einsum (Einstein summary convention), also known as Einstein notation or Einstein operator, can easily represent various linear operations.
In the deep learning task, linear operations are widely used, and may be applied to, for example, a Linear layer, a Matmul (matrix multiplication) layer, or the like. Therefore, the Einstein operator is very important, and the model design and implementation of a user can be greatly accelerated.
In the related art, the operation of the einstein operator needs to call a kernel supporting the corresponding operator, but the startup and the call of the kernel consume a large amount of computing resources and time.
Disclosure of Invention
The disclosure provides a data processing method, a data processing device, an electronic device and a storage medium.
According to an aspect of the present disclosure, there is provided a data processing method including:
acquiring a first tensor required by a first Einstein operator;
calling an inner core to perform transposition operation on the first vector to obtain a first vector transposition result;
storing the first vector transposition result into a target storage medium;
generating subscripts of the first tensor of the second einstein operator according to a preset marking rule at the stage of planning and scheduling the first tensor of the second einstein operator; the preset marking rule meets the requirement of multiplexing the first quantum transposition result;
under the condition that the second Einstein operator needs to multiplex the first quantum transposition result, reading the first quantum transposition result from the target storage medium;
and executing the operation process of the second Einstein operator based on the read first tensor transposition result and the subscript of the first tensor of the second Einstein operator.
According to another aspect of the present disclosure, there is provided a data processing apparatus including:
the first acquisition module is used for acquiring a first tensor required by a first Einstein operator;
the calling module is used for calling the inner core to execute transposition operation on the first vector to obtain a first vector transposition result;
the first storage module is used for storing a first vector transposition result into a target storage medium;
the generation module is used for generating subscripts of the first tensor of the second Einstein operator according to a preset marking rule at the stage of planning and scheduling the first tensor of the second Einstein operator; the preset marking rule meets the requirement of multiplexing the first vector transposition result;
the reading module is used for reading the first quantum transposition result from the target storage medium under the condition that the second Einstein operator needs to multiplex the first quantum transposition result;
and the execution module is used for executing the operation process of the second Einstein operator based on the read first tensor transposition result and the subscript of the first tensor of the second Einstein operator.
According to another aspect of the present disclosure, there is provided an electronic device including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of any of the embodiments of the present disclosure.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method according to any one of the embodiments of the present disclosure.
According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method according to any one of the embodiments of the present disclosure.
In the embodiment of the disclosure, the first scalar transposition result of the first einstein operator is multiplexed in the calculation process of the second einstein operator, so that the kernel is prevented from being called by starting for many times, thereby saving resource consumption and improving the operation efficiency.
It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a flow diagram of a data processing method according to an embodiment of the present disclosure;
FIG. 2 is a schematic flow diagram of transposing a resulting gradient into a target gradient according to another embodiment of the present disclosure;
FIG. 3a is a schematic diagram of one example of a data processing method according to another embodiment of the present disclosure;
FIG. 3b is a schematic diagram of another example of a data processing method according to another embodiment of the present disclosure;
FIG. 4 is a schematic diagram of an application of a data processing method in the field of natural language processing according to another embodiment of the present disclosure;
FIG. 5 is a schematic diagram of an application of a data processing method according to another embodiment of the present disclosure in the field of protein structure prediction;
FIG. 6 is a schematic block diagram of a data processing apparatus according to another embodiment of the present disclosure;
fig. 7 is a block diagram of an electronic device for implementing a data processing method of an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The terms "first," "second," and the like in the embodiments of the present disclosure are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprises" and "comprising," as well as any variations thereof, are intended to cover a non-exclusive inclusion, such as a list of steps or elements. The methods, systems, articles, or apparatus need not be limited to the explicitly listed steps or elements, but may include other steps or elements not expressly listed or inherent to such processes, methods, articles, or apparatus.
The einstein operator can easily represent various linear operations. For example: ij- > ji can represent transposition, ij, jk- > ik can represent matrix multiplication, and can also represent outer product, inner product, and the like. All linear operations are represented by summation through a uniform Einstein operator, and the memory burden of a user can be greatly relieved.
In the deep learning framework, the einstein operator, which is based on two operands, accepts mainly 3 parameters: 2 input operands (e.g., tensor a and tensor B), and a string compute expression, and finally outputs a tensor C. For example, in the expression c = pad.
In general, the forward implementation of almost all einstein operators can be logically divided into two parts: planner (phase of planning scheduling) and executive (phase of execution). The planer part is to plan how to combine correlation operations such as Transpose, matmul, sum, etc. by means of an Equation. A large einstein operator is therefore broken down into a very large number of small calculations which are calculated in a certain order to obtain the final result.
There are two main implementations of the einstein operator at present. One is a Trace (retrospective) mode, and the other is a combination mode.
The combination method usually realizes an efficient Einstm Kernel forward function at the C + + terminal, which is called Einstm Kernel (Einstm Kernel kernel), and then calls Einstm GradKernel (Einstm Kernel kernel) function when gradient calculation is needed. Where the EinspumGradKernel function performs the derivation of the 2 operands to the input (e.g., tensor A and tensor B) by calling 2 EinspumKernel times. The Einstein operator is realized in a combined mode, good modularization can be achieved, forward logic can be shared reversely, the realization is simple, and the subsequent optimization is facilitated.
The Trace mode is simpler than a relative combination mode, and is specifically realized as follows: and recording the sequence of the operators participating in the calculation during forward calculation, and calling the corresponding reverse operator once again according to the sequence opposite to the forward calculation. For example, if the Transpose operator- > Matmul operator is called in the forward direction, then the Matmul gradient operator- > Transpose gradient operator may be called in the reverse direction.
Compared with the combined mode, the Trace mode has the advantage of high reverse speed, but has the disadvantages of complex implementation and high dependence on the basic components provided by the framework. If the framework does not support the presence of the inverse Op (operation code) and does not provide the basic Trace component, the implementation cost will be significant. Because the reverse Trace component to Op is almost a core mechanism of a deep learning framework.
In view of this, the embodiments of the present disclosure modify the combination manner, so as to increase the speed of executing the einstein operator in the combination manner, so as to reduce the consumption of the core resource. In order to achieve the goal, the embodiment of the present disclosure provides a technical idea that the combination mode can reuse forward propagation, so as to expect to accelerate the operation speed of the einstein operator in the combination mode by reducing the number of forward call kernels, and achieve the purpose of saving computational resources.
Based on the technical concept, the data processing method can be applied to any scene needing to execute the Einstein operator, such as the fields of natural language processing, machine vision, protein structure prediction and the like.
As shown in fig. 1, which is a flowchart of a data processing method in an embodiment of the present disclosure, the method includes:
s101, acquiring a first tensor required by a first Einstein operator.
The first tensor is an operand of the first einstein operator, such as the a tensor and the B tensor in the previous example.
S102, invoking an inner core to execute transposition operation on the first vector to obtain a first vector transposition result.
S103, storing the first quantum transposition result into a target storage medium.
The target storage medium may be, for example, a cache.
And S104, generating subscripts of the first tensor of the second Einstein operator according to a preset marking rule at the stage of planning and scheduling the first tensor of the second Einstein operator. The preset marking rule meets the requirement of multiplexing the first scalar transposition result.
As set forth above, any Einstein operator includes a Planner stage and an Executor stage. The index of the Einstein operator needs to be marked in the planer stage first, and then the operation is executed according to the marked result in the Executor stage. As shown in table 1, is an example of labeling subscripts. The set and the subscripts under the set are included in table 1. Where there are 2 operands, A and B respectively, e.g., equalisation: mik, mjk- > mij. The planener stage classifies and labels subscripts, and common categories include 4, which may include:
batch (hereinafter also referred to as a first set of labels);
free (hereinafter also referred to as the second set of tokens or the third set of tokens);
a reduction and a reduction.
In the Einstein operator, A and B are operands respectively, and O is an output result. Accordingly, batch represents a subscript common to A, B, and O. free indicates that there is only AO or a subscript on BO. Context denotes a subscript that is present in both A and B, but not in O. Reduction indicates that subscripts exist only in A or B.
According to this classification rule, in evaluation: mik, mjk- > mij, m belongs to batch, i and j belong to free class, and k belongs to the connection class. For a reduction, such as einsum ("i, j- > j", A, B), when there is i in the input parameters, but there is no i in the output result after the operation, the reduction for i is indicated.
ABO AO BO AB A B
batch freeA freeB contraction reduction reduction
TABLE 1
In the embodiment of the disclosure, by planning the marking rule of the planener stage, the second einstein operator can multiplex the forward propagation result of the first einstein operator. For example, during the calculation of the second einstein operator, the kernel may be invoked to perform the same operations as during the calculation of the first einstein operator. For example, the operands of the second einstein operator and the first einstein operator both comprise the first vector, and the transpose of the first vector is required in the calculation of both einstein operators. Then, the first magnitude in the second einstein operator may be made to multiplex the first magnitude transpose result of the first einstein operator by presetting the marking rule. At this time, if the first magnitude transposition result is needed in the second einstein operator calculation, the first magnitude transposition result can be directly read from the target storage medium, so as to meet the calculation requirement of the second einstein operator. Therefore, the transposition result of the first vector is directly reused, the transposition operation is repeatedly executed instead of repeatedly calling the transposed kernel, and a part of the intermediate calculation process of the second Einstein operator is omitted.
Thus, S105 can be performed, in the case where the second einstein operator needs to multiplex the first quantum transposition result, reading the first quantum transposition result from the target storage medium.
And S106, executing the operation process of the second Einstein operator based on the read first tensor transposition result and the subscript of the first tensor of the second Einstein operator.
In the related art, each time the first quantum transpose result is desired, the kernel needs to be started and called to perform the transpose operation on the first quantum. However, in practical situations, the cost of invoking the kernel to perform the related operations is very high, and multiple times of invoking the kernel may seriously affect the computational efficiency and increase the hardware burden. In the embodiment of the disclosure, the first scalar transposition result of the first einstein operator is reused in the calculation process of the second einstein operator, so that multiple starting and kernel calling are avoided, resource consumption is greatly reduced, and the operation efficiency is improved.
In some embodiments, the first einstein operator includes two operands, and the first tensor is either one of the two operands. That is, the first operand of the first einstein operator may be regarded as the first tensor, and the transposed result of the first operand thereof may be stored for the second einstein operator to multiplex. The second operand of the first einstein operator can also be regarded as the first tensor, and the transposed result of the second operand thereof is stored for the second einstein operator to multiplex. It is also possible to treat both the first operand and the second operand of the first einstein operator as the first tensor, respectively, and store the transposed result of each of the first operand and the second operand thereof for multiplexing by the second einstein operator.
In the embodiment of the present disclosure, any operand in the einstein operator is used as the first tensor, and is not limited to defining a specific operand as the first tensor, and the target storage medium can be reasonably used as required. In addition, by multiplexing the first scalar transposition result, the number of times of calling the kernel can be effectively reduced, and the resource consumption is reduced. In addition, in the model training stage, the first sheet quantity is determined according to needs, so that the data processing process is more flexible and efficient.
For ease of understanding, the generation of the preset marking rule in the embodiments of the present disclosure, and the specific contents of the rule will be described in detail below.
For example, in reverse, the calculation may be performed using two forward directions. For example, when O = page.einum ("ibnd, jbnd- > bnij", a, B) is calculated in reverse, dA = page.einum ("ibnd, jbnd- > bnij", dO, B) may be used, that is, operand a is replaced by dO, and the operand a's Equation is replaced by O's Equation, which corresponds to the operands a and O being interchanged. If the transposed result is to be multiplexed, it is considered that the multiplexing can be realized if the subscripts of TA1, TA2, and TA3 are identical in three Einsum Kernel replacements. The reason why multiplexing is possible is analyzed and explained below when the subscripts of TA1, TA2, and TA3 are completely the same, and the expressions are 1), 2), and 3) for the following three forward calls are taken as examples):
O=paddle.einsum(A,B)1)
dA=paddle.einsum(B,dO)2)
B=paddle.einsum(A,dO)3)
where TA1, TA2, and TA3 correspond to the transposed subscripts of a included in the operand a in expressions 1), 2), and 3), respectively.
Because operand a does not participate in the calculation as an input operand in expression 2), expression 2) does not require the transpose of multiplexed operand a. But operand a is the input operand in expression 3), it is desirable that expression 3) be able to multiplex the transpose of operand a in expression 1). Since the operand A is TA1 in 1) is: ABO, AO, AB. Operand A TA3 in 3) is: ABO, AB, AO. It can be seen that TA1 and TA3 do not coincide, so the transpose of operand a cannot be multiplexed. Thus, the disclosed embodiments enable expression 3) to multiplex the transpose of operand a in expression 1) by modifying the marking rules.
In view of this, in this embodiment of the present disclosure, at the stage of planning and scheduling the first tensor of the second einstein operator, the subscript of the first tensor of the second einstein operator is generated according to a preset labeling rule, where the preset labeling rule includes:
(1) The ordering order of the same elements in the same mark set of the first Einstein operator and the second Einstein operator is the same.
Wherein, the mark sets are shown in Table 1, namely sets ABO, AO, BO, AB, A and B.
If the element contained in operand A is "ibnd", the element contained in operand B is "jbnd", and the element contained in O is "bnij". The order of the elements in the same set in both expression 1) and expression 3) is "bn", instead of "bn" in the ABO set of expression 1), which becomes "nb" in the ABO set of expression 3). Thus, the disclosed embodiments require that the same elements in the same tag set be arranged in the same order.
(2) The forward subscripts of the different operands are not consistent, i.e.:
in case the first tensor is the first operand of the second einstein operator, the transposed subscript of the first operand satisfies the order ABO, AO, AB. That is, the transpose of the first operand can be obtained by sequentially splicing the elements in the ABO, AO and AB sets. For example, if the operand a is ibnd, the element included in the ABO set is bn, the element included in the AO set is i, and the element included in the AB set is d, the elements in the ABO, AO, and AB sets are sequentially spliced to obtain the transpose bnid of the operand a.
In the case where the first tensor is the second operand of the second einstein operator, the transposed subscript of the second operand satisfies the order ABO, AB, BO. That is, when the operand is the second operand, its transpose can be obtained by sequentially connecting the elements in the sets ABO, AO, AB. For example, if the operand a is ibnd, the element included in the ABO set is bn, the element included in the AO set is i, and the element included in the AB set is d, the elements in the ABO, AO, and AB sets are sequentially spliced to obtain the transpose bnid of the operand a.
Where the ABO, AO and BO sets have the same meaning as previously indicated, e.g. ABO is the first set of labels, and the elements in ABO are contained in both operands of the second einstein operator and in the output result of the second einstein operator.
AO is the second mark set, the element in AO is contained in the first operand of the second Einstein operator and contained in the output result of the second Einstein operator.
BO is a third set of labels, the elements in BO being contained in the second operand of the second einstein operator and in the output result of the second einstein operator.
In the embodiment of the disclosure, by formulating the preset marking rule, the second einstein operator meets the condition of multiplexing the first vector transposition result in the first einstein operator, so that the multiplexing of the first vector transposition result is ensured, and a condition is provided for simplifying the operation process of the second einstein operator. The transposed result is cached to a storage medium for multiplexing, and the number of times of calling the transposed core is also reduced.
In practice, the three transposes of the forward calculation may all be stored to the target storage medium. For example, transpose TA of operand a, transpose TB of operand B, and transpose TdO of output result O in expression 1 are all stored for multiplexing.
In some embodiments, in a case where the first tensor is the first operand of the first einstein operator and the second einstein operator is used to determine the gradient of the second operand of the first einstein operator, in order to be able to multiplex as much of the transpose of the forward calculation as possible, the expression of the second einstein operator is determined in the embodiments of the present disclosure as a first target expression, which is as shown in equation 4):
dO×A->dB 4)
in equation 4), dO represents the gradient of the output result of the first einstein operator, a represents the first tensor, and dB represents the gradient of the second operand of the first einstein operator.
And taking the read first tensor transposition result as the transposition of A in the first target expression, taking the transposition of the gradient of the output result of the first Einstein operator read from the target storage medium as the transposition of dO in the first target expression, and executing the first target expression based on the subscript of the first tensor of the second Einstein operator to obtain the gradient of the second operand of the first Einstein operator.
Thus, in the disclosed embodiments, in order to be able to multiplex the transposes of the forward stages as much as possible, a first target expression is elaborated to solve the gradient of the second operand of the first einstein operator. Therefore, the number of times of calling the kernel is reduced by increasing the number of multiplexing transpositions, and therefore computing resources are saved.
Similarly, in a case where the first tensor is the second operand of the first einstein operator, and the second einstein operator is used to determine the gradient of the first operand of the first einstein operator, the expression of the second einstein operator is determined as the second target expression, which is the same as expression 2), that is: bxdo- > dA, where dO represents the gradient of the output result of the first einstein operator, B represents the first tensor, and dA represents the gradient of the first operand of the first einstein operator.
And taking the read first tensor transposition result as the transposition of B in a second target expression, taking the transposition of the gradient of the output result of the first Einstein operator read from the target storage medium as the transposition of dO in the second target expression, and executing the second target expression based on the subscript of the first tensor of the second Einstein operator to obtain the gradient of the first operand of the first Einstein operator.
Thus, in the disclosed embodiments, in order to be able to multiplex as many transposes of the forward stages as possible, a second target expression is employed to solve for the gradient of the second operand of the first einstein operator. Therefore, the number of times of calling the kernel is reduced by increasing the number of multiplexing transpositions, and therefore computing resources are saved.
In summary, in order to multiplex the three transpositions in expression 1) as much as possible, in the embodiment of the present disclosure, expression 3) in the original forward calculation process is modified to expression 4), so that the three transpositions in expression 1) can be multiplexed in both expression 2) and expression 4) in the reverse stage.
In some embodiments, the gradient of the finally determined operand may not be suitable for subsequent calculations due to the transpose operation and the adoption of preset labeling rules. To this end, in the embodiment of the present disclosure, the obtained gradient may be transposed to a target gradient available for subsequent calculation through a transposition operation. As shown in fig. 2, it can be implemented as:
s201, under the condition that the second Einstein operator carries out reverse operation relative to the first Einstein operator, the intermediate gradient is stored in a target storage medium.
S202, reading the intermediate gradient from the target storage medium, and calling the kernel to transpose the subscript of the target gradient according to the subscript of the corresponding target operand to obtain the target gradient of the target operand.
Wherein:
under the condition that the target operand is the first operand of the first Einstein operator, the gradient of the first operand of the first Einstein operator determined by the second Einstein operator is an intermediate gradient;
in the case where the target operand is the second operand of the first einstein operator, the gradient of the second operand of the first einstein operator determined by the second einstein operator is an intermediate gradient.
For example, the index of the intermediate gradient dB is calculated as bndj by multiplexing TdO and TA, and based on expression 1), it is known that the index of the required target gradient dB is jbnd, and then the index of the intermediate gradient needs to be converted into the index order of the target gradient through a transposing operation, that is, bndj is transposed into jbnd.
In some embodiments, the target storage medium has stored therein at least the gradient of the first operand of the first einstein operator as determined by the second einstein operator and the gradient of the second operand of the first einstein operator as determined by the second einstein operator. When transposing intermediate gradients, the gradient of the first operand may be read from the target storage medium, or the gradient of the second operand may be read, or both the gradient of the first operand and the gradient of the second operand may be read.
In the embodiment of the disclosure, an intermediate gradient obtained by performing inverse operation on the second einstein operator with respect to the first einstein operator is stored in a target storage medium for subsequent multiplexing of the intermediate gradient and transposing the intermediate gradient into a target gradient, so as to improve the accuracy of calculation.
It should be noted that, no matter whether the scheme in the embodiment of the present disclosure is used to implement multiplexing of the transposition result, in the related art, the obtained dB and dA also need to be transposed, and therefore, in terms of the overall flow, the number of times of invoking the kernel is not increased in the transposition operation for the intermediate gradient in the embodiment of the present disclosure.
In the disclosed embodiments, in addition to considering the case where the einstein operator has two operands, consider also the case where the einstein operator may have multiple operands. When the einstein operator has multiple operands, the following method can be extended to implement the multiplexing of the translated results:
in the case where the target einstein operator comprises n operands in an ordered arrangement, the target einstein operator is disassembled into a plurality of first einstein operators that are executed in order based on the following method:
and determining two operands which are sequenced into a first position and a second position in the n operands as a first operand of the first Einstein operator corresponding to the operand of the second position, and obtaining an output result of the first Einstein operator corresponding to the operand of the second position.
In the case where there is an unprocessed operand among the n operands, the operand ordered first among the unprocessed operands is determined to be the target operand. And the output result of the first Einstein operator corresponding to the last operand of the target operand and the target operand are respectively used as the first operand and the second operand of the first Einstein operator corresponding to the target operand, so that the output result of the first Einstein operator corresponding to the target operand is obtained.
In some embodiments, for example, the target einstein operator has three operands, D, E, and F, and the two operands D and E ordered as the first position and the second position are used as the two operands of the first einstein operator corresponding to the operand E at the second position, and the output result of the first einstein operator corresponding to the operand E at the second position can be obtained.
In this case, there is F, which is an unprocessed operand among the three operands D, E, and F, since there is only one operand F left, F should be the first target operand in the unprocessed operand. The last operand, E, corresponds to destination operand F. And then taking the output result of the first einstein operator corresponding to the E and the target operand F as a first operand and a second operand of the first einstein operator corresponding to the target operand F respectively, and continuing to operate the output result of the first einstein operator corresponding to the E and the target operand F according to the processing process of the two operands so as to obtain the output result of the first einstein operator corresponding to the target operand F until no remaining unprocessed operands exist, and ending the cycle.
In the embodiment of the disclosure, under the condition that the einstein operator may have a plurality of operands, the complicated einstein operator can be disassembled into the einstein operator with only two operands, and the method for multiplexing the transposed result in the embodiment of the disclosure is executed, so that the number of times of calling the kernel is reduced, the computing resource is saved, the multiplexing of the output result of the einstein operator is realized, and the operational efficiency of the einstein operator can be improved.
For facilitating understanding of the embodiment of the present disclosure, an overall flow of the data processing method of the embodiment of the present disclosure is described with reference to fig. 3 a:
as shown in fig. 3a, a and B are input as two operands, the output C obtained after the first calculation of a and B is derived and dC is also used as the input number. And transposing the A, the B and the dC for the first time to obtain TA, TB and TdC in the middle layer, wherein the TA, TB and TdC can multiplex 3 caches. Here 3 danpos can be reduced. In the reverse direction, it is not necessary to repeat the calculation of TA, TB and TC. And continuously performing BMM (BatchedMatMul, matrix multiplication batch processing) on the TA, the TB and the TdC to obtain intermediate output results C, dB and dA. The intermediate output result cannot be directly output for use, but the inversion operation is needed again, and the inverted C, dB and dA are obtained as the final available output result.
For a specific calculation process, take fig. 3b as an example (note that the table header of fig. 3b represents a tag set, which is different from a specific operand):
if the forward calculation is: einum ("ibd, jbnd- > bnij", a, B), since a is located in the first operand, when the forward subscripts of the two operands are required to be inconsistent, the transpose of the first operand is: ABO set, AO set, AB set; the transpose of the second operand is: ABO set, AB set, BO set. ABO | AO | AB- > bnid is obtained, so the subscript of TA1 is: bnid.
The inverse is calculated as: einum ("bnij, ibnd- > jbnd", dO, a), since operand a is now located in the second operand, the transpose of the second operand is: ABO set, AB set and BO set, and A, B and O in the header are respectively: a- > dO, B- > A, O- > dB. The transpose of the operand a can be obtained as: ABO | AB | BO- > OAB | OA | AB- > bnid, as can be seen when TA2= bnid.
In summary, in combination with the forward calculation and the backward calculation processes, the subscripts of the corresponding operands are adjusted in the backward calculation according to the labeling rule, so that TA1= TA2 finally, and the multiplexing condition is satisfied.
After the optimization method is used, the TA, TB and TC of inverse multiplexing can be ensured to be correct. The reverse speed of the einstein operator can be leveled to almost the same level as the Trace mode. Experiments prove that the optimized reverse processing needs 35ms, and the reverse processing needs 40ms compared with the original combination mode of not multiplexing the transposed result, so that the improvement is 16 percent.
In summary, experiments prove that the reverse calculation speed can be greatly increased by optimizing the einstein operator reverse propagation process in the embodiment of the present disclosure under the condition of consuming the same video memory. Meanwhile, under the condition of optimal theory, 6 kernel operations including 3 times of Transpose operations and 3 times of ReduceSum operations can be reduced in the back propagation process of the Einstein operator.
Experiments have shown that the number of transposes can be reduced to a very low level, with a, B, O being 2 inputs and 1 output, then dO being the gradient of the output. Due to the need to obtain three matrices, O and dA, dB, the kernel count is, without considering broadcast, without considering reduction, and in extreme cases (theoretical worst case): the theoretical minimum is 1 matmul +2 Tranpose in the forward direction (TA and TB acquired for a and B). There are 2 matmul operations in the reverse direction at the same time, and one transfer to dO. Finally, 3 transposes of the pair output in FIG. 3a are added. So it is 3 input Transpose +3 output Transpose +3 matmul. It should be noted that this value is theoretical, and in the worst case, since the total number of times of calling the kernel in the best planning algorithm in the related art is optimally this number, the solution of the embodiment of the present disclosure can still achieve advantages. In addition, there are other situations, such as that no Transpose is required for output or input, and the number of calls to the kernel can be further reduced by adopting the method of the embodiment of the disclosure. In summary, in the disclosed embodiments, theoretically each variable needs to be transferred only once, whereas in the related art the same variable may be transferred multiple times. If Reduction and broadcasting are combined, the result of Reduction + Transpose and the result of broadcasting + Transpose can also be saved, so that Reduction of the number of calls to the kernel by multiplexing is also applicable in the case of combining Reduction and broadcasting.
The data processing method according to the embodiment of the present disclosure will be described with reference to the field of natural language processing as an example. As shown in fig. 4:
FIG. 4 shows the XLNT (Generalized adaptive prediction for transform-XL, transform-XL expansion) model, with the eigen operator applied in the Relative Attention of XLNT. Specific application is Two Masked Two-stream attention modules in fig. 4. Specifically, the eigen operator combines a plurality of complex linear operations on the embeding tensor of the text input, and finally outputs an attention weight (output attributes).
The data processing method according to the embodiment of the present disclosure will be described with reference to the field of protein structure prediction as an example. For example, in the gray area of fig. 5, the model includes: PLM (Protein Language Model), adapter and geometric Model. Based on this protein structure prediction model, predicting the structure of a candidate protein can be implemented as shown in the gray area of fig. 5:
constructing primary structure information and an attention map of the candidate protein based on a protein language model in the protein structure prediction model; in which, the disclosed embodiments employ 3 hundred million single sequences (e.g., -300M primary sequences in the gray area of fig. 5) to train the PLM model. So that the PLM can accurately extract primary structure information and attention map.
Inputting the primary structure information of the candidate protein and the attention map into an adapter layer (adapter) of a protein structure prediction model to obtain the secondary structure information of the candidate protein. As shown in the gray area of fig. 5, the dashed box behind the adapter layer shows the secondary structure information, which includes single-sequence representation 1 (singel repr.) and pair representation 1 (pair repr.). Wherein the adapter layer may comprise two linear layers, wherein the input of the primary structure information into one of the linear layers results in a single sequence representation 1 in the secondary structure information, and the input of the attention map into the other linear layer results in a paired representation 1 in the secondary structure information.
And inputting the secondary structure information of the candidate protein into the geometric model to obtain the tertiary structure information of the candidate protein. As shown in the gray area of fig. 5, the Geometric model may be a Geometric Modeling (Geometric model) of the AlphaFold model. So as to accurately predict the structure of the candidate protein by using the structure prediction capability of the geometric model.
It should be noted that the original EvoFormer (eval creation module) in the AlphaFold model uses the searched MSA (Measurement Systems Analysis) as an input. Alternatively, the output of the adapter layer is adopted as the MSA in the embodiment of the disclosure, so that the process of searching the MSA is omitted, and the prediction speed is increased. Second, the evovermer in the embodiments of the present disclosure employs various attention mechanisms to exchange information in the single-sequence representation and the paired representation to learn spatial relationships.
In the embodiment of the disclosure, the Structure Module (Structure Module) adopts the single-sequence representation and the pair representation generated by the evoform, and uses invariant point attention and other geometric transformation operators to realize end-to-end prediction of 3D coordinates of atoms in a docking Structure.
The disclosed embodiments train the PLM model with 3 billion single sequences (e.g., -300M primary sequences in the gray area of fig. 5). Because the structure prediction by means of PLM alone is not enough to capture the required characteristic information, the PLMBase (PLM) and the geological Modeling module in the protein structure prediction model (HelixFold-Single) are jointly optimized. Optimization was performed using 10 ten thousand experimentally determined protein structures (1M estimated structures in the grey area of FIG. 5). Training was performed with an additional one million estimated protein structures (e.g., -120K defined structures in the gray region of fig. 5). The network is trained end-to-end using major impairments, including Frame Alignment Point Error (FAPE) impairments and other auxiliary impairments. HelixFols-Single is able to provide efficient and accurate protein structure prediction by combining a computationally efficient PLMBase module (compared to MSA search) with a Geometric Modeling module.
In the protein structure prediction model shown in the gray region of fig. 5, the einstein operator is used, the sequence of the protein, that is, the operand of the einstein operator, is input as the protein sequence, and the attention weight (that is, the attention map in the gray region of fig. 5) is output by combining a plurality of einsum operators.
In the field of protein structure prediction, the method of the embodiment of the disclosure reuses the transposition of a forward-propagated protein sequence, so that the number of times of starting and calling a kernel can be reduced, computing resources can be saved, and the efficiency of model pre-training can be improved.
The data processing method of the embodiment of the present disclosure is described by taking the field of machine vision as an example. For example, in automatic driving control, an image of the surrounding environment of the vehicle is acquired by an image acquisition device of the vehicle, and the features of the surrounding environment are extracted from the image by a feature extraction module in the visual model.
In the field of machine vision, the method of the embodiment of the invention reuses the transpose of the image features of forward propagation, so that the number of times of calling the kernel can be reduced, the computing resources are saved, and the efficiency of model pre-training is improved.
It should be noted that the transposing result of the multiplexing cache is not only applicable to the field of machine vision, but also applicable to other fields when the gradient of the model parameter is solved through back propagation. Such as the natural language processing field and the protein structure prediction field.
Based on the same technical concept, an embodiment of the present disclosure further provides a data processing apparatus, as shown in fig. 6, including:
the first acquisition module is used for acquiring a first tensor required by a first Einstein operator;
the calling module is used for calling the inner core to execute transposition operation on the first vector to obtain a first vector transposition result;
the first storage module is used for storing a first vector transposition result into a target storage medium;
the generation module is used for generating subscripts of the first tensor of the second einstein operator according to a preset marking rule at the stage of planning and scheduling the first tensor of the second einstein operator; the preset marking rule meets the requirement of multiplexing the first vector transposition result;
the reading module is used for reading the first quantum transposition result from the target storage medium under the condition that the second Einstein operator needs to multiplex the first quantum transposition result;
and the execution module is used for executing the operation process of the second Einstein operator based on the read first tensor transposition result and the subscript of the first tensor of the second Einstein operator.
In some embodiments, the preset marking rules include:
the ordering sequence of the same elements in the same mark set of the first Einstein operator and the second Einstein operator is the same;
under the condition that the first tensor is a first operand of the second einstein operator, the transposed subscript of the first operand meets the sequence of ABO, AO and AB;
in the case where the first tensor is the second operand of the second einstein operator, the transposed subscript of the second operand satisfies the order of ABO, AB, BO:
wherein ABO is a first set of labels, and elements in ABO are contained in two operands of a second Einstein operator and in an output result of the second Einstein operator;
AO is the second mark set, the element in AO is contained in the first operand of the second Einstein operator and contained in the output result of the second Einstein operator;
BO is a third set of labels, the elements in BO being contained in the second operand of the second einstein operator and in the output result of the second einstein operator.
In some embodiments, an execution module to:
in the case where the first tensor is a first operand of the first einstein operator and the second einstein operator is used to determine a gradient of a second operand of the first einstein operator, determining an expression of the second einstein operator as a first target expression, the first target expression being: dO xA- > dB, where dO represents the gradient of the output of the first Einstein operator, A represents the first tensor, and dB represents the gradient of the second operand of the first Einstein operator;
and taking the read first tensor transposition result as the transposition of A in the first target expression, taking the transposition of the gradient of the output result of the first Einstein operator read from the target storage medium as the transposition of dO in the first target expression, and executing the first target expression based on the subscript of the first tensor of the second Einstein operator to obtain the gradient of the second operand of the first Einstein operator.
In some embodiments, the execution module is further to:
in the case where the first tensor is the second operand of the first einstein operator and the second einstein operator is used to determine the gradient of the first operand of the first einstein operator, determining the expression of the second einstein operator as the second target expression, the second target expression being: bxdo- > dA, where dO represents a gradient of an output result of the first einstein operator, B represents a first tensor, and dA represents a gradient of a first operand of the first einstein operator;
and taking the read first tensor transposition result as the transposition of B in a second target expression, taking the transposition of the gradient of the output result of the first Einstein operator read from the target storage medium as the transposition of dO in the second target expression, and executing the second target expression based on the subscript of the first tensor of the second Einstein operator to obtain the gradient of the first operand of the first Einstein operator.
In some embodiments, the data processing apparatus further comprises:
the second storage module is used for storing the intermediate gradient into the target storage medium under the condition that the second Einstein operator carries out inverse operation relative to the first Einstein operator;
the second obtaining module is used for reading the intermediate gradient from the target storage medium and calling the kernel to transpose the subscript of the target gradient according to the subscript of the corresponding target operand to obtain the target gradient of the target operand;
wherein:
under the condition that the target operand is the first operand of the first Einstein operator, the gradient of the first operand of the first Einstein operator determined by the second Einstein operator is an intermediate gradient;
in the case where the target operand is the second operand of the first einstein operator, the second einstein operator determines that the gradient of the second operand of the first einstein operator is an intermediate gradient.
In some embodiments, the first einstein operator has two operands, further comprising a splitting module to:
in the case where the target einstein operator comprises an ordered arrangement of n operands, the target einstein operator is broken down into a plurality of first einstein operators that execute in order based on the following method:
determining two operands which are sequenced into a first position and a second position in the n operands as a first operand of a first Einstein operator corresponding to the operand of the second position, and obtaining an output result of the first Einstein operator corresponding to the operand of the second position;
determining a first-ordered operand in the unprocessed operands as a target operand under the condition that the unprocessed operand exists in the n operands; and the number of the first and second antennas is increased,
and taking the output result of the first Einstein operator corresponding to the last operand of the target operand and the target operand as the first operand and the second operand of the first Einstein operator corresponding to the target operand respectively to obtain the output result of the first Einstein operator corresponding to the target operand.
In some embodiments, the first einstein operator includes two operands, and the first tensor is either one of the two operands.
For a description of specific functions and examples of each module and sub-module of the apparatus in the embodiment of the present disclosure, reference may be made to the description of corresponding steps in the foregoing method embodiments, and details are not repeated here.
In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 7, the device 700 comprises a computing unit 701 which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, a modem, a wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
Computing unit 701 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 executes the respective methods and processes described above, such as a data processing method. For example, in some embodiments, the data processing method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When the computer program is loaded into the RAM703 and executed by the computing unit 701, one or more steps of the data processing method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the data processing method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (17)

1. A method of data processing, comprising:
acquiring a first tensor required by a first Einstein operator;
calling an inner core to perform transposition operation on the first vector to obtain a first vector transposition result;
storing the first vector transposition result into a target storage medium;
generating subscripts of the first tensor of a second einstein operator according to a preset marking rule at the stage of planning and scheduling the first tensor of the second einstein operator; the preset marking rule meets the requirement of multiplexing the first vector transposition result;
reading the first quantum transposition result from the target storage medium under the condition that the second Einstein operator needs to multiplex the first quantum transposition result;
executing an operation process of the second einstein operator based on the read first tensor transposition result and the subscript of the first tensor of the second einstein operator.
2. The method of claim 1, wherein the preset marking rule comprises:
the ordering sequence of the same elements in the same mark set of the first Einstein operator and the second Einstein operator is the same;
in the case that the first tensor is a first operand of the second einstein operator, the transposed subscript of the first operand satisfies the order of ABO, AO, AB;
in the case where the first tensor is the second operand of the second einstein operator, the transposed subscript of the second operand satisfies the order ABO, AB, BO:
wherein ABO is a first set of labels, elements in ABO being included in both operands of the second Einstein operator and in the output result of the second Einstein operator;
AO is the second mark set, the element in AO is included in the first operand of the stated second Einstein operator and included in the output result of the stated second Einstein operator;
BO is a third set of labels, the elements in BO being contained in the second operand of the second einstein operator and in the output result of the second einstein operator.
3. The method of claim 1 or 2, the transposing of the gradient of the output result of the first einstein operator also being stored into the target storage medium, wherein the performing of the operation of the second einstein operator based on the read first tensor transposed result and the index of the first tensor of the second einstein operator comprises:
determining an expression of the second einstein operator as a first target expression in the case that the first tensor is a first operand of the first einstein operator and the second einstein operator is used for determining a gradient of a second operand of the first einstein operator, the first target expression being: dO xA- > dB, wherein dO represents the gradient of the output result of the first Einstein operator, A represents the first tensor, and dB represents the gradient of the second operand of the first Einstein operator;
taking the read first tensor transposition result as the transposition of A in the first target expression, taking the transposition of the gradient of the output result of the first Einstein operator read from the target storage medium as the transposition of dO in the first target expression, and executing the first target expression based on the subscript of the first tensor of the second Einstein operator to obtain the gradient of the second operand of the first Einstein operator.
4. The method of claim 1 or 2, the transpose of the gradient of the output result of the first einstein operator also being stored into the target storage medium, wherein the operation process of the second einstein operator is performed based on the read first tensor transposed result and the index of the first tensor of the second einstein operator, comprising:
in the case where the first tensor is the second operand of the first einstein operator and the second einstein operator is used to determine the gradient of the first operand of the first einstein operator, determining the expression of the second einstein operator as a second target expression, the second target expression being: bxdo- > dA, where dO represents a gradient of an output result of the first einstein operator, B represents the first tensor, and dA represents a gradient of a first operand of the first einstein operator;
taking the read first tensor transposed result as the transpose of B in the second target expression, taking the transpose of the gradient of the output result of the first Einstein operator read from the target storage medium as the transpose of dO in the second target expression, and executing the second target expression based on the subscript of the first tensor of the second Einstein operator to obtain the gradient of the first operand of the first Einstein operator.
5. The method of any of claims 1-4, further comprising:
storing an intermediate gradient into the target storage medium under the condition that the second einstein operator performs inverse operation with respect to the first einstein operator;
reading the intermediate gradient from the target storage medium, and calling a kernel to perform transposition operation on the subscript of the target gradient according to the subscript of the corresponding target operand to obtain the target gradient of the target operand;
wherein:
in the case that the target operand is a first operand of a first einstein operator, the gradient of the first operand of the first einstein operator determined by the second einstein operator is the intermediate gradient;
in the case where the target operand is a second operand of a first einstein operator, the second einstein operator determines that the gradient of the second operand of the first einstein operator is the intermediate gradient.
6. The method of any of claims 1-5, wherein the first Einstein operator has two operands, further comprising:
in the case where a target einstein operator comprises n operands in an ordered arrangement, the target einstein operator is disassembled into a plurality of first einstein operators that are executed in order based on the following method:
determining two operands which are sequenced into a first position and a second position in the n operands as a first operand of a first Einstein operator corresponding to the operand of the second position, and obtaining an output result of the first Einstein operator corresponding to the operand of the second position;
determining a first-ordered operand in the unprocessed operands as a target operand in the case that the unprocessed operand exists in the n operands; and the number of the first and second electrodes,
and taking the output result of the first Einstein operator corresponding to the last operand of the target operand and the target operand as the first operand and the second operand of the first Einstein operator corresponding to the target operand respectively to obtain the output result of the first Einstein operator corresponding to the target operand.
7. The method of any of claims 1-6, wherein the first Einstein operator includes two operands, the first tensor being either of the two operands.
8. A data processing apparatus comprising:
the first acquisition module is used for acquiring a first tensor required by a first Einstein operator;
the calling module is used for calling a kernel to execute transposition operation on the first vector to obtain a first vector transposition result;
the first storage module is used for storing the first vector transposition result into a target storage medium;
a generating module, configured to generate subscripts of the first tensor of a second einstein operator according to a preset labeling rule at a stage of planning and scheduling the first tensor of the second einstein operator; the preset marking rule meets the requirement of multiplexing the first vector transposition result;
a reading module, configured to read the first quantum transposition result from the target storage medium when the second einstein operator needs to multiplex the first quantum transposition result;
an executing module, configured to execute an operation process of the second einstein operator based on the read first tensor transposed result and a subscript of the first tensor of the second einstein operator.
9. The apparatus of claim 8, wherein the preset marking rule comprises:
the ordering sequence of the same elements in the same mark set of the first Einstein operator and the second Einstein operator is the same;
in the case that the first tensor is a first operand of the second einstein operator, a transposed subscript of the first operand satisfies an order of ABO, AO, AB;
in the case where the first tensor is the second operand of the second einstein operator, the transposed subscript of the second operand satisfies the order ABO, AB, BO:
wherein ABO is a first set of labels, elements in ABO being included in both operands of the second Einstein operator and in the output result of the second Einstein operator;
AO is the second mark set, the element in AO is included in the first operand of the stated second Einstein operator and included in the output result of the stated second Einstein operator;
BO is a third set of labels, the elements in BO being contained in the second operand of the second einstein operator and in the output result of the second einstein operator.
10. The apparatus of claim 8 or 9, the transpose of the gradient of the output result of the first einstein operator also being stored in the target storage medium, wherein the means for performing is configured to:
in the case where the first tensor is a first operand of the first einstein operator and the second einstein operator is used to determine a gradient of a second operand of the first einstein operator, determining an expression of the second einstein operator as a first target expression, the first target expression being: dO xA- > dB, wherein dO represents the gradient of the output result of the first Einstein operator, A represents the first tensor, dB represents the gradient of the second operand of the first Einstein operator;
taking the read first tensor transposition result as the transposition of A in the first target expression, taking the transposition of the gradient of the output result of the first Einstein operator read from the target storage medium as the transposition of dO in the first target expression, and executing the first target expression based on the subscript of the first tensor of the second Einstein operator to obtain the gradient of the second operand of the first Einstein operator.
11. The apparatus of claim 8 or 9, the transpose of the gradient of the output result of the first einstein operator also being stored in the target storage medium, wherein the means for performing is configured to:
determining an expression of the second einstein operator as a second target expression if the first tensor is a second operand of the first einstein operator and the second einstein operator is used for determining a gradient of the first operand of the first einstein operator, the second target expression being: bxdo- > dA, where dO represents a gradient of an output result of the first einstein operator, B represents the first tensor, and dA represents a gradient of a first operand of the first einstein operator;
taking the read first tensor transposed result as the transpose of B in the second target expression, taking the transpose of the gradient of the output result of the first Einstein operator read from the target storage medium as the transpose of dO in the second target expression, and executing the second target expression based on the subscript of the first tensor of the second Einstein operator to obtain the gradient of the first operand of the first Einstein operator.
12. The apparatus of any of claims 8-11, further comprising:
a second storage module, configured to store an intermediate gradient in the target storage medium when the second einstein operator performs a reverse operation with respect to the first einstein operator;
a second obtaining module, configured to read the intermediate gradient from the target storage medium, and call a kernel to perform a transposition operation on a subscript of the target gradient according to a subscript of a corresponding target operand, so as to obtain a target gradient of the target operand;
wherein:
in the case that the target operand is the first operand of the first einstein operator, the gradient of the first operand of the first einstein operator determined by the second einstein operator is the intermediate gradient;
in the case where the target operand is a second operand of a first einstein operator, the second einstein operator determines that the gradient of the second operand of the first einstein operator is the intermediate gradient.
13. The apparatus of any one of claims 8-12, wherein the first einstein operator has two operands, further comprising a split module to:
in the case where a target einstein operator comprises an ordered arrangement of n operands, the target einstein operator is broken down into a plurality of first einstein operators that execute in order based on the following method:
determining two operands which are sequenced into a first position and a second position in the n operands as a first operand of a first Einstein operator corresponding to the operand of the second position, and obtaining an output result of the first Einstein operator corresponding to the operand of the second position;
determining a first-ordered operand in the unprocessed operands as a target operand in the case that the unprocessed operand exists in the n operands; and the number of the first and second electrodes,
and taking the output result of the first Einstein operator corresponding to the last operand of the target operand and the target operand as the first operand and the second operand of the first Einstein operator corresponding to the target operand respectively to obtain the output result of the first Einstein operator corresponding to the target operand.
14. The apparatus of any one of claims 8-13, wherein the first einstein operator includes two operands, the first tensor being either of the two operands.
15. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.
16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.
17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-7.
CN202211488824.9A 2022-11-25 2022-11-25 Data processing method, device, electronic equipment and storage medium Active CN115759294B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211488824.9A CN115759294B (en) 2022-11-25 2022-11-25 Data processing method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211488824.9A CN115759294B (en) 2022-11-25 2022-11-25 Data processing method, device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN115759294A true CN115759294A (en) 2023-03-07
CN115759294B CN115759294B (en) 2023-10-24

Family

ID=85337922

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211488824.9A Active CN115759294B (en) 2022-11-25 2022-11-25 Data processing method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115759294B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170200094A1 (en) * 2016-01-07 2017-07-13 1026 Labs, Inc. Hardware accelerated machine learning
CN109299725A (en) * 2018-07-27 2019-02-01 华中科技大学鄂州工业技术研究院 A kind of forecasting system and device based on the decomposition of tensor chain Parallel Implementation high-order dominant eigenvalue
US20200409664A1 (en) * 2019-06-27 2020-12-31 Amazon Technologies, Inc. Transpose operations using processing element array
JP2021018677A (en) * 2019-07-22 2021-02-15 株式会社Preferred Networks Information processing system, method for generating neural network structure and information processing program
CN114201242A (en) * 2021-12-10 2022-03-18 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for processing data
CN114556372A (en) * 2019-09-03 2022-05-27 辉达公司 Processor and system for transforming tensor operations in machine learning
CN114724254A (en) * 2022-05-16 2022-07-08 北京百度网讯科技有限公司 Method, device, equipment, storage medium and program product for determining action category
CN115081607A (en) * 2022-05-19 2022-09-20 北京百度网讯科技有限公司 Reverse calculation method, device and equipment based on embedded operator and storage medium
CN115169541A (en) * 2022-08-17 2022-10-11 无锡江南计算技术研究所 Tensor, vector and scalar calculation acceleration and data scheduling system

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170200094A1 (en) * 2016-01-07 2017-07-13 1026 Labs, Inc. Hardware accelerated machine learning
CN109299725A (en) * 2018-07-27 2019-02-01 华中科技大学鄂州工业技术研究院 A kind of forecasting system and device based on the decomposition of tensor chain Parallel Implementation high-order dominant eigenvalue
US20200409664A1 (en) * 2019-06-27 2020-12-31 Amazon Technologies, Inc. Transpose operations using processing element array
CN114008586A (en) * 2019-06-27 2022-02-01 亚马逊技术股份有限公司 Transpose operation using an array of processing elements
JP2021018677A (en) * 2019-07-22 2021-02-15 株式会社Preferred Networks Information processing system, method for generating neural network structure and information processing program
CN114556372A (en) * 2019-09-03 2022-05-27 辉达公司 Processor and system for transforming tensor operations in machine learning
CN114201242A (en) * 2021-12-10 2022-03-18 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for processing data
CN114724254A (en) * 2022-05-16 2022-07-08 北京百度网讯科技有限公司 Method, device, equipment, storage medium and program product for determining action category
CN115081607A (en) * 2022-05-19 2022-09-20 北京百度网讯科技有限公司 Reverse calculation method, device and equipment based on embedded operator and storage medium
CN115169541A (en) * 2022-08-17 2022-10-11 无锡江南计算技术研究所 Tensor, vector and scalar calculation acceleration and data scheduling system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ROBERT CIMRMAN: "Fast evaluation of fnite element weak forms using python tensor contraction packages", 《 ADVANCES IN ENGINEERING SOFTWARE》, pages 1 - 26 *
周琦: "基于FPGA的张量分解计算单元及其在人脸识别中的应用", 《中国优秀硕士学位论文全文数据库》, pages 1 - 68 *
黄春 等: "面向深度学习的批处理矩阵乘法设计与实现", 《计算机学报》, vol. 45, no. 2, pages 225 - 239 *

Also Published As

Publication number Publication date
CN115759294B (en) 2023-10-24

Similar Documents

Publication Publication Date Title
CN109902186B (en) Method and apparatus for generating neural network
Sun et al. Extracting entities and relations with joint minimum risk training
US10860829B2 (en) Data-parallel parameter estimation of the Latent Dirichlet allocation model by greedy Gibbs sampling
CN114911465B (en) Method, device and equipment for generating operator and storage medium
US20220215177A1 (en) Method and system for processing sentence, and electronic device
US11651198B2 (en) Data processing method and apparatus for neural network
US11900263B2 (en) Augmenting neural networks
WO2023045149A1 (en) Image fusion method and apparatus, electronic device, and storage medium
CN113159013B (en) Paragraph identification method, device, computer equipment and medium based on machine learning
CN114187459A (en) Training method and device of target detection model, electronic equipment and storage medium
US20220101194A1 (en) Method, electronic device, and computer program product for processing machine learning model
CN114201242B (en) Method, device, equipment and storage medium for processing data
CN114495102A (en) Text recognition method, and training method and device of text recognition network
CN113762109B (en) Training method of character positioning model and character positioning method
CN113407610B (en) Information extraction method, information extraction device, electronic equipment and readable storage medium
CN114495101A (en) Text detection method, and training method and device of text detection network
CN117114063A (en) Method for training a generative large language model and for processing image tasks
CN115759294A (en) Data processing method and device, electronic equipment and storage medium
CN115809688B (en) Model debugging method and device, electronic equipment and storage medium
CN114792097B (en) Method and device for determining prompt vector of pre-training model and electronic equipment
CN113642654B (en) Image feature fusion method and device, electronic equipment and storage medium
CN116151374A (en) Distributed model reasoning method, device, equipment, storage medium and program product
CN115457365A (en) Model interpretation method and device, electronic equipment and storage medium
CN114860411A (en) Multitask learning method and device, electronic equipment and storage medium
CN114661904A (en) Method, apparatus, device, storage medium, and program for training document processing model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant