CN116627427A

CN116627427A - Compiling method and compiling device

Info

Publication number: CN116627427A
Application number: CN202310722888.9A
Authority: CN
Inventors: 翟季冬; 郑立言; 王豪杰
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2023-06-16
Filing date: 2023-06-16
Publication date: 2023-08-22

Abstract

A compiling method, a compiling apparatus, an electronic device, a computer-readable storage medium, and a computer program product are disclosed. The method comprises the following steps: acquiring a calculation graph in a tensor program to be compiled, and converting the calculation graph into an initial intermediate representation; determining a plurality of equivalent intermediate representations equivalent to the intermediate representation based on a search space defined by a set of transformation rules; determining a target intermediate representation using the plurality of equivalent intermediate representations based on a distance between an intermediate representation of a preset operator and the equivalent intermediate representation; and determining executable code corresponding to the computational graph based on the target intermediate representation.

Description

Compiling method and compiling device

Technical Field

The present disclosure relates to a compiling method, a compiling apparatus, an electronic device, a computer-readable storage medium, and a computer program product.

Background

In order to generate efficient executable code on a general purpose processor or domain specific artificial intelligence chip, a typical artificial intelligence programming framework represents an artificial intelligence application as a tensor program whose computation process is represented as a computational graph made up of tensors and operators. In the compiling optimization stage, the artificial intelligence application represented by the computational graph performs layer optimization and operator layer optimization respectively, wherein the former finds a more efficient computational graph mainly through equivalent transformation of the computational graph, and the latter mainly generates efficient codes for different hardware.

Layer optimization is mainly divided into manual optimization and automatic optimization. Among them, manual optimization requires system development engineers to have high domain knowledge and take a lot of time to find optimization opportunities. While automatic optimization is too inefficient and requires a significant amount of computational effort to support. Meanwhile, the operator optimization has the problems of long development period, high maintenance cost, difficulty in better supporting a new operator or a custom operator and the like.

Therefore, further improvements in compiling artificial intelligence applications represented by computational graphs are needed to increase the efficiency of compiling and reduce computational effort.

Disclosure of Invention

The embodiment of the disclosure provides a compiling method and device, electronic equipment, a computer readable storage medium and a computer program product.

The embodiment of the disclosure provides a compiling method, which comprises the following steps: acquiring a calculation graph in a tensor program to be compiled, and converting the calculation graph into an initial intermediate representation; determining a plurality of equivalent intermediate representations equivalent to the intermediate representation based on a search space defined by a set of transformation rules; determining a target intermediate representation using the plurality of equivalent intermediate representations based on a distance between an intermediate representation of a preset operator and the equivalent intermediate representation; and determining executable code corresponding to the computational graph based on the target intermediate representation.

The embodiment of the disclosure provides a compiling device, which comprises: the first module is used for acquiring a calculation graph in the tensor program to be compiled and converting the calculation graph into an initial intermediate representation; a second module for determining a plurality of equivalent intermediate representations equivalent to the intermediate representation based on a search space defined by a set of transformation rules; a third module for determining a target intermediate representation using the plurality of equivalent intermediate representations based on a distance between an intermediate representation of a preset operator and the equivalent intermediate representation; and a fourth module for determining executable code corresponding to the computational graph based on the target intermediate representation.

The embodiment of the disclosure provides an electronic device, comprising: a processor; and a memory, wherein the memory stores a computer executable program that, when executed by the processor, performs the method described above.

The disclosed embodiments provide an apparatus comprising: a processor; and a memory storing computer instructions which, when executed by the processor, implement the above-described method.

Embodiments of the present disclosure provide a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the above-described method.

According to another aspect of the present disclosure, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable medium and executes the computer instructions to cause the computer device to perform the aspects described above or methods provided in various alternative implementations of the aspects described above.

The embodiment of the disclosure can fully combine the information of the layers and the operator layers to carry out graph calculation fusion optimization on the compiling process of the tensor program. The disclosed embodiments automatically optimize intermediate representations based on equivalent transformations of intermediate representations and explore a broader intermediate representation search space using algebraic equivalence principles. In addition, the memory layout and algebraic operation in the calculation process are searched respectively, so that the search time of the automatic optimization process is reduced, and the limitation that the memory layout and algebraic operation cannot be searched simultaneously in the prior art is solved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are required to be used in the description of the embodiments will be briefly described below. The drawings in the following description are only exemplary embodiments of the present disclosure.

Fig. 1 shows a schematic diagram of an application scenario according to an embodiment of the present disclosure.

Fig. 2 is a flowchart illustrating a compiling method according to an embodiment of the disclosure.

Fig. 3 is a schematic diagram illustrating a compiling apparatus according to an embodiment of the disclosure.

Fig. 4 is a schematic table illustrating a subset of a set of transformation rules according to an embodiment of the present disclosure.

Fig. 5 is a diagram illustrating an iterator mapping table according to an embodiment of the present disclosure.

Fig. 6 is a diagram illustrating one example of operation S203 according to an embodiment of the present disclosure.

Fig. 7 is a schematic diagram illustrating pseudo code according to an embodiment of the present disclosure.

Fig. 8 shows a schematic diagram of an electronic device according to an embodiment of the disclosure.

FIG. 9 illustrates an architectural diagram of a computing device according to an embodiment of the present disclosure.

Fig. 10 shows a schematic diagram of a computer-readable storage medium according to an embodiment of the disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more apparent, exemplary embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present disclosure and not all of the embodiments of the present disclosure, and that the present disclosure is not limited by the example embodiments described herein.

In the present specification and drawings, steps and elements having substantially the same or similar are denoted by the same or similar reference numerals, and repeated descriptions of the steps and elements will be omitted. Meanwhile, in the description of the present disclosure, the terms "first," "second," and the like are used merely to distinguish the descriptions, and are not to be construed as indicating or implying relative importance or order.

For purposes of describing the present disclosure, the following presents concepts related to the present disclosure.

The scheme provided by the embodiment of the disclosure relates to artificial intelligence and other technologies, and is specifically described by the following embodiment.

First, an application scenario of a method according to an embodiment of the present disclosure and a corresponding apparatus or the like will be described with reference to fig. 1. Fig. 1 shows a schematic diagram of an application scenario 100, in which a server 110 and a plurality of user terminals 120 are schematically shown, according to an embodiment of the present disclosure.

The compiling method of the embodiment of the present disclosure may be integrated in various electronic devices, for example, any electronic device of the server 110 and the plurality of user terminals 120 in fig. 1. For example, a model for processing video data may be integrated in the user terminal 120. The user terminal 120 may be, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal computer (PC, personal Computer), a smart speaker, a smart watch, or the like. For another example, the compiling method of the embodiment of the disclosure may also be integrated in the server 110. The server 110 may be an independent physical server, a server 110 cluster or a distributed system formed by a plurality of physical servers, or a cloud server 110 providing a basic cloud computing service. The terminals and the server 110 may be directly or indirectly connected through wired or wireless communication, and the present disclosure is not limited herein.

Current academia and industry have proposed solutions for automatically optimizing layers. The schemes adopt a traversing and verifying algorithm, namely, firstly, traversing and combining all operators supported in an operator library, then screening operator combinations equivalent to the computational graph to be optimized through verification, and finally, selecting the most efficient equivalent computational graph through an automatic optimization method. The schemes are optimized by taking operators as granularity, and the bottom information such as the computation semantics of the operators is ignored. Meanwhile, because the optimization work does not consider the implementation details of operators and can only search in a given operator set, the optimization space is greatly limited.

In addition, current academia and industry also propose schemes for automatically optimizing operator layers. These schemes abstract the upper computational logic of the operator as "computation", the lower architectural related optimizations as "scheduling", and generate executable code through rich scheduling primitives and autotune techniques. Although the schemes can show the calculation process of operators, the expression method is oriented to the generation process of codes, and due to the complexity and limitation of a scheduling strategy, the work can only optimize a single operator, and a plurality of operators in a calculation graph are difficult to jointly optimize, so that more optimization opportunities are missed.

Based on this disclosure, there is provided a compiling method including: acquiring a calculation graph in a tensor program to be compiled, and converting the calculation graph into an initial intermediate representation; determining a plurality of equivalent intermediate representations equivalent to the intermediate representation based on a search space defined by a set of transformation rules; determining a target intermediate representation using the plurality of equivalent intermediate representations based on a distance between an intermediate representation of a preset operator and the equivalent intermediate representation; and determining executable code corresponding to the computational graph based on the target intermediate representation.

The method and the device fully combine the information of the layers and the operator layers, and perform graph calculation fusion optimization on the compiling process of the tensor program. The disclosed embodiments automatically optimize intermediate representations based on equivalent transformations of intermediate representations and explore a broader intermediate representation search space using algebraic equivalence principles. In addition, the memory layout and algebraic operation in the calculation process are searched respectively, so that the search time of the automatic optimization process is reduced, and the limitation that the memory layout and algebraic operation cannot be searched simultaneously in the prior art is solved.

All or a portion of embodiments in accordance with the present disclosure are described in more detail below in conjunction with fig. 2-9.

Fig. 2 is a flowchart illustrating a compiling method 20 according to an embodiment of the disclosure. Fig. 3 is a schematic diagram illustrating a compiling apparatus according to an embodiment of the disclosure. Fig. 4 is a schematic table illustrating a subset of a set of transformation rules according to an embodiment of the present disclosure. Fig. 5 is a diagram illustrating an iterator mapping table according to an embodiment of the present disclosure.

The compiling method 20 according to the embodiment of the disclosure may be applied to any electronic device. It is understood that the electronic device may be a different kind of hardware device, such as a Personal Digital Assistant (PDA), an audio/video device, a mobile phone, an MP3 player, a personal computer, a laptop computer, a server 110, etc. For example, the electronic device may be the server 110 and the user terminal 120 of fig. 1, etc. Hereinafter, the present disclosure is described by taking the server 110 as an example, and those skilled in the art should understand that the present disclosure is not limited thereto.

In particular, the compiling method 20 according to the embodiment of the present disclosure is suitable for compiling tensor programs. Artificial intelligence applications are typically represented as tensor programs by an artificial intelligence programming framework. Tensors are a form of data storage that is multidimensional, with the dimensions of the data being referred to as the order of the tensors. It can be seen as an extension of vectors and matrices in multidimensional space. For example, a vector may be considered a one-dimensional tensor and a matrix may be considered a two-dimensional tensor. The tensor program is a program constructed based on a data storage format such as tensor. Compilation refers to the process of converting a tensor program into low-level instructions that can be executed on specific hardware. Of course, the present disclosure is not limited thereto.

For example, the method 20 according to an embodiment of the present disclosure includes the following operations S201 to S204. Of course, the present disclosure may include more or fewer operations, and the present disclosure is not limited thereto.

In operation S201, a computational graph in a tensor program to be compiled is acquired and converted into an initial intermediate representation.

Optionally, the computational graph is a data structure containing a set of operators and data units flowing between the operators. Operators are basic computing units for processing tensor data, and are used for realizing common computing logic in various machine learning, including data conversion, condition control, mathematical operation and the like. The tensor program to be compiled may be represented as a combination of multiple computational graphs. Each computational graph represents a subroutine. Of course, the present disclosure is not limited thereto.

Alternatively, as shown in fig. 3, the calculation map is represented as a map composed of nodes and line segments with arrows. Wherein the nodes represent operators, the arrowed line segments represent transfer dependencies between data, and the transferred data is tensor. Of course, the present disclosure is not limited thereto.

Alternatively, a program segmenter may be used to obtain multiple computational graphs from a tensor program to be compiled. Alternatively, the input of the program divider may be a calculation map of the whole tensor program to be compiled, and to accelerate the optimization process, the program divider may further perform the following operations: and dividing the tensor program to be compiled into at least one calculation graph by taking a nonlinear activating operator in the tensor program to be compiled as a dividing point. The nonlinear-enabled operator is, for example, an operator identifier indicating a nonlinear operation such as a Softmax operation. Nonlinear operators (e.g., nonlinear functions or activation functions) typically provide no further optimization opportunities beyond operator fusion, so tensor programs can be partitioned by nonlinear operators.

Alternatively, a program translator, such as that shown in FIG. 3, may also be used to convert the computational graph to an initial intermediate representation. The program translator may perform the following operations: each of the at least one computational graph is converted to an initial intermediate representation based on the each computational graph. Of course, the present disclosure is not limited thereto.

Optionally, the initial intermediate representation is an intermediate representation that can be directly transformed from the computational graph, which has not been optimized. The intermediate representation according to embodiments of the present disclosure may also be referred to as a tensor intermediate representation (intermediate representation of tensor computation) for describing the computational content that the tensor computation program needs to perform. Alternatively, the intermediate representation may describe which calculations should be performed by a symbolic representation similar to a linear algebraic calculation process. In addition to symbols similar to linear algebra, the intermediate representation also includes iterator indicators, iterator variable indicators, and so forth, to facilitate optimization in subsequent compilation processes. Thus, the intermediate representation may be used to symbolically describe the result of the computation without describing the specific manner in which the hardware performs the computation. That is, the current intermediate representation does not yet have to be explicitly compiled into a piece of executable code. In particular, the intermediate representation comprises at least one of: an indicator for indicating a sequential value space for element traversal, an indicator for indicating a value space for an iteration variable, an indicator for indicating a memory variable layout. Thus, the intermediate representation can serve as a bridge between the layers and the operator layers in the tensor program, which is helpful for the subsequent fusion optimization between the layers and the operator layers.

For example, as shown in FIG. 3, an example of an intermediate representation may be T _n (A[n]+B[n]) Wherein A [ n ]]And B [ n ]]Representing two tensors, an]And B [ n ]]N is an indicator for indicating the memory variable layout, which represents the memory layout in an indexed manner to support different memory variable layouts. Operators are represented by the symbol "+" and T is an indicator of the sequential value space used to indicate element traversal. T is also known as a traversal symbol, which can explicitly represent the sequential value space traversed by an element, T _n Is an indicator for indicating the value space of the iteration variable, and is used for explicitly marking the value space of the iteration summation variable to support mapping transformation of the value space. Thereby, the middle represents T _n (A[n]+B[n]) Important characteristics such as traversal sequence, dimension value space, memory layout and the like in the tensor program are represented.

In a more specific alternative embodiment, it is assumed that one computational graph in the tensor program indicates that tensor a and tensor K are convolved to obtain tensor B. Wherein, optionally, tensor a is input image data, and tensor K is convolution kernel. The computational graph may be expressed mathematically as B [ n, f, h, w ] =conv (a, K). The program translator may convert the computational graph into an intermediate representation shown in equation (1).

T _nhwf ∑ _crs A[n,c,h+r,w+s]K[r,s,f,c](1)

As described above, T in equation (1), i.e., the Traversal symbol (Traversal), explicitly represents the sequential value space of the element Traversal. The subscripts n, h, w, f, c, r, s explicitly denote the value space of the iterative sum variable. "n, c, h+r, w+s" in A [ n, c, h+r, w+s ] and "r, s, f, c" in K [ r, s, f, c ] represent memory layouts in an indexed manner to support different memory variable layouts.

Next, with continued reference to fig. 3, operations S202 and S203 may be performed using a mutation optimizer. The abrupt optimizer may be divided into two sub-modules, namely, a calculation transformation rule sub-module and a distance guidance search sub-module. Wherein the calculation transformation rule sub-module block is used to perform operation S202, and the distance guidance search sub-module is used to perform operation S203.

In operation S202, a plurality of equivalent intermediate representations equivalent to the initial intermediate representation are determined based on the search space defined by the set of transformation rules.

Alternatively, the transformation rules are also referred to as computational transformation rules, which define equivalent transformation rules for the intermediate representation. These equivalent transformation rules do not change the computation results of the intermediate representation, but only the computation of the intermediate representation. The transformation rule set is given a series of equivalent transformation rules for the intermediate representation. The transformation rules may be predefined. Fig. 4 shows partially predefined transformation rules that may be used to define the search space of the equivalent intermediate representation. That is, only intermediate representations obtained by equivalent transformation that satisfies the transformation rule can be considered as equivalent intermediate representations.

As shown in fig. 4, the transformation rule set includes a multi-intermediate representation rule and a single intermediate representation rule. Among the multiple intermediate representation rules include, but are not limited to, rules for splitting intermediate representations, rules for merging intermediate representations, and rules for merging intermediate representations.

Single intermediate representation rules include, but are not limited to, rules for summing split operations on intermediate representations, rules for variable substitution operations on intermediate representations, rules for traversal merge operations on intermediate representations, rules for boundary relaxation operations on intermediate representations, rules for boundary tightening operations on intermediate representations.

In particular, the rules for splitting the intermediate representation indicate that one intermediate representation is split into multiple independent intermediate representations. The rule for merging intermediate representations indicates that multiple independent intermediate representations are merged into one intermediate representation. Rules for fusing intermediate representations indicate that multiple dependent intermediate representations are combined into one intermediate representation. Rules for performing a summing splitting operation on intermediate representations indicate that the corresponding summing scope of a particular intermediate representation is split into two equivalent summing scopes. Rules for performing variable replacement operations on intermediate representations indicate equivalent replacement of iteration variables in a particular intermediate representation that are involved in the traversal operation. Rules for performing traversal merge operations on intermediate representations indicate that two scopes of a particular intermediate representation that are involved in the traversal merge operation are merged into one equivalent scope. The rules for performing boundary relaxation operations on intermediate representations indicate that the range of values of the iteration variable for a particular intermediate representation is equivalently expanded. The rule for performing the boundary tightening operation on the intermediate representation indicates that the range of values of the iteration variable for the particular intermediate representation is equivalently narrowed. Of course, the disclosure is not so limited.

Because tensor algebra has equivalence, the equivalent transformation conforming to the transformation rule set does not change the overall equivalence of the tensor program. Thus, searching in the search space may obtain a plurality of intermediate representations equivalent to the initial intermediate representation.

Optionally, operation S202 further includes: based on a maximum search depth, a search is performed in the search space to determine a plurality of equivalent intermediate representations equivalent to the intermediate representation. Where the maximum search depth refers to the number of steps taken from the initial intermediate representation to an equivalent intermediate representation. When the search reaches the maximum search depth, stopping the search, and judging whether the intermediate representation equivalent to the initial intermediate representation exists or not, so that resource waste and efficiency reduction caused by too deep search can be avoided.

Next, in operation S203, a target intermediate representation is determined using the plurality of equivalent intermediate representations based on a distance between an intermediate representation of a preset operator and the equivalent intermediate representation. As described above, operation S203 may alternatively be performed by the distance guidance search sub-module described above.

Alternatively, in operation S202 described above, a number of equivalent intermediate representations may be obtained. These equivalent intermediate representations are not necessarily capable of being compiled into better performing executable code. If these equivalent intermediate representations are compiled one-by-one into executable code, then comparing the computational performance of these executable codes wastes a significant amount of computational resources and time. Therefore, it is necessary to select an equivalent intermediate representation capable of causing the computational performance of the executable code to be improved as a target intermediate representation in operation S203. For this reason, it is desirable to devise efficient search methods to achieve fast finding target intermediate representations that can be compiled into efficient executable code.

Alternatively, these equivalent intermediate representations can be broadly divided into two types, namely, computationally intensive equivalent intermediate representations and computationally non-intensive equivalent intermediate representations. In an embodiment of the present disclosure, an operator library is preset for computationally intensive equivalent intermediate representations. A large number of preset operators are stored in the operator library. These preset operators may correspond to specific intermediate representations and these intermediate representations have been experimentally proven to be able to be translated into better performing executable code.

In order to be able to efficiently match computationally intensive operators to preset operators in an operator library, the similarity between intermediate representations of the preset operators can be measured by calculating the distance between the two intermediate representations. The smaller the distance from the intermediate representation of the preset operator, the closer the equivalent intermediate representation is to the preset intermediate operator.

Alternatively, the distance between any two intermediate representations may be defined as the number of different iteration variables that the two intermediate representations possess, the greater the number of different iteration variables the greater the distance therebetween, the lesser the number of different iteration variables the lesser the distance therebetween. Whereby the calculation of the distance between the intermediate representation of the preset operator and the equivalent intermediate representation comprises: determining a first set of iteration variables in an iterator of a plurality of tensors associated with the preset operator based on an intermediate representation of the preset operator; determining, based on the equivalent intermediate representation, a second set of iteration variables in an iterator of a plurality of tensors related to operators in the equivalent intermediate representation; and calculating a distance between an intermediate representation of the preset operator and the equivalent intermediate representation based on the difference between the first set and the second set. Of course, the present disclosure is not limited thereto.

Alternatively, the iterator map shown in FIG. 5 may be utilized to determine the distance between any two intermediate representations. Each row in the iterator map corresponds to a preset operator that can be extended to support any number of preset operators.

Specifically, referring to fig. 5, preset operators include, but are not limited to, operators for tensor calculation such as convolution (conv), batch matrix multiplication (batch matrix), tensor addition (add), matrix multiplication from a common matrix to a diagonal matrix (G2 BMM), and the like. The second column in the table shown in fig. 5 shows the intermediate representations of these preset operators. The third column in the table shown in fig. 5 shows the iterators for these intermediate representations. The iterator comprises a first input tensor (I ₀ ) Is an iterator of (1), a second input tensor (I ₁ ) Is provided, and an output iterator (O) ₀ ). Of course, the present disclosure is not limited thereto.

Taking the "Add" operator as an example, the iteration variables of the first input tensor, the second input tensor, and the output tensor are (m, n). Thus, the first column of the iterator, which represents that the iteration variables for which all three exist, include m and n. The second column of the iterator, which indicates that only the iteration variables present in the first input tensor and the output tensor should be filled in as null. Similarly, the third column of the iterator, which indicates that only the iteration variables present in the second tensor and the output tensor should be filled in as empty. The fourth column of the iterator, which represents that the variables present in the first input tensor and the second input tensor only are null. Of course, the present disclosure is not limited thereto.

Taking batch matrix multiplication (BatchMatmul) operator as an example. The first column of the iterator is filled in with b, i.e. b this variable will be one of the iteration variables of the first input tensor and will be one of the iteration variables of the second input tensor and one of the iteration variables of the output tensor. The second column of the iterator is filled in with m, i.e. m this variable will only be one of the iteration variables of the first input tensor and one of the iteration variables of the output tensor, and will not appear in the second input tensor. The third column of the iterator is filled in with n, i.e. n this variable will only be one of the iteration variables of the second input tensor and one of the iteration variables of the output tensor, and will not appear in the first input tensor. The fourth column of the iterator is filled in with k, i.e. k this variable will only be one of the iteration variables of the first input tensor and one of the iteration variables of the second input tensor and will not appear in the output tensor.

Thus, the distance between the intermediate representation of the preset operator and the equivalent intermediate representation can be calculated based on the iterator shown in the above manner. Specifically, in a similar manner to that of fig. 5, an iterator corresponding to the equivalent intermediate representation may be determined, and then each column of the iterator is compared with an iterator of a preset operator, and if the iterator and the iterator have more values of the same column, the distance between the iterator and the iterator is smaller. Thus, the related information of a part of the content of the target intermediate representation can be directly and automatically generated according to the intermediate representation of the preset operator. Of course, the present disclosure is not limited thereto.

In at least one embodiment of the present disclosure, in addition to selecting one equivalent intermediate representation as the target intermediate representation directly from the equivalent intermediate representations, further searches may be performed in the search space based on these equivalent intermediate representations, and transformation rules that enable the transformed equivalent intermediate representations to be reduced in distance from the preset operator are heuristically selected during the search to perform further equivalent transformations on these equivalent intermediate representations until an equivalent intermediate representation that is closer to the intermediate representation of the preset operator is found. A specific example of operation S203 will be further described later with reference to fig. 6.

Specifically, in this embodiment of the present disclosure, operation S203 includes: for each equivalent intermediate representation of the plurality of equivalent intermediate representations, searching in the search space for a transformation rule that enables a distance between the transformed equivalent intermediate representation and an intermediate representation of a preset operator to be reduced; performing equivalent transformation on the equivalent intermediate representation by using the searched transformation rule to obtain an updated equivalent intermediate representation; and selecting the target intermediate representation from the updated plurality of equivalent intermediate representations. Of course, the present disclosure is not limited thereto.

In at least one embodiment of the present disclosure, there may also be some operators in the equivalent intermediate representation that are not in the operator library. These operators will be denoted as quasi operators. These quasi-operators will then generate their corresponding kernels through an automated compilation framework. With continued reference to FIG. 3, an intermediate representation of the target is shown in the form of a computational graph. Through operations S202 and S203 described above, the computation graph corresponding to the target intermediate representation may include quasi-operators and operators. Of course, the present disclosure is not limited thereto.

Next, in operation S204, executable code corresponding to the computation graph is determined based on the target intermediate representation. The optional operation S204 may be performed by the overall post-optimization module in fig. 3.

Optionally, operation S204 includes: directly converting a part matched with a preset operator in the target intermediate representation into an executable code; and converting the part, which is not matched with the preset operator, of the target intermediate representation into executable codes through a code generation framework.

In particular, since the intermediate representations of the partial memory-intensive operators have been optimized in the target intermediate representation to the intermediate representations corresponding to the preset operators, and these intermediate representations of the memory-intensive operators have been proven to be convertible into better performing executable code. Thus, the portion of the target intermediate representation that matches the preset operator (i.e., the operator portion in FIG. 3) may be directly converted to executable code that has proven to be superior.

Whereas for the quasi-operators (i.e. the parts of the target intermediate representation that do not match the preset operators), these quasi-operators can be fed directly into the code generation framework, since the target intermediate representation can in fact exactly define the computation speech. The code generation framework converts the iterated variables of the iterator into tensors and automatically optimizes these quasi-operators. Alternatively, these quasi-operators are not highly memory intensive operators, so even using code generation frameworks to automatically optimize these quasi-operators does not require too much computational resources. Of course, the present disclosure is not limited thereto.

Furthermore, as described above, the tensor program may be divided into multiple computational graphs and each computational graph may be converted into an intermediate representation of the object. In this case, operation S204 further includes: and splicing the target intermediate representations corresponding to the plurality of calculation graphs to determine the calculation graph corresponding to the tensor program. Therefore, the overall optimization of the target intermediate representations corresponding to the multiple calculation graphs is realized, and executable codes corresponding to the tensor program are determined based on the overall optimized target intermediate representations.

Next, some implementation details of operation S203 of the present disclosure are further described with reference to fig. 6.

As described above, determining the target intermediate representation may be difficult. Example (a) in fig. 6 is one equivalent intermediate representation, while example (b) is the intermediate representation corresponding to the batch matrix multiplication (batch matmul) operator (which is a preset operator). The iterator map shown in fig. 5 may be utilized to determine whether the equivalent intermediate representation shown in example (a) can match the intermediate representation of the preset operator shown in example (b) above.

First, to determine the mapping relationship of example (a) and example (b), all possible one-to-one mappings between input/output tensors of example (a) and example (b) may be enumerated. For example, if the input/output tensor of example (a) were to be mapped to the input/output tensor shown in example (b), then two possible tensor mappings might be enumerated, namely: the first tensor map { A→X, B→Y } and the second tensor map { A→Y, B→X }. The first tensor mapping means that tensor a is mapped to tensor X and tensor B is mapped to tensor Y. The second tensor mapping means that tensor a is mapped to tensor Y and tensor B is mapped to tensor B.

Next, an iterator map as shown in fig. 5 may be utilized to further determine whether the iterators of the first and second tensor maps are equivalent. For example, referring to FIG. 6, for the first tensor map { A→X, B→Y }, the iterator { u, v, X, w } of example (B) is equivalent to the iterator { B, m, k, n } of example (a) from the iterator map shown in FIG. 5. If there are multiple iterators, all possible mappings between the iterators may be enumerated.

The tensor representation includes attribute information (e.g., data layout information of the input tensor) on how to calculate the tensor. To match these attribute information, the input and output tensors can be remodelled into one-dimensional tensors to hide the complexity of the tensor shape. Then, the attribute information is matched by checking the variable coefficients of the one-dimensional tensor. Examples (c) and (d) represent one-dimensional tensors corresponding to tensors B and Y, respectively. To determine if the two are equivalent, it is necessary to determine l _B0 Whether or not equal to l _Y0 +l _Y1 And l _B1 Whether or not equal to l _Y2 Wherein l _Yn Is the n-dimensional step size of the tensor Y. Further, it is also necessary to determine whether the coefficient of w in example (d) is equal to the coefficient of n in example (c). If both are equal, then tensor B and tensor Y are explained to be equivalent. Correspondingly, it is also necessary to determine whether tensor a and tensor X are equivalent.

Through the above three steps, it can be determined whether the equivalent intermediate representation shown in example (a) can match the intermediate representation of the preset operator shown in example (b) above.

Next, some implementation details of specific embodiments of the present disclosure are further described with reference to fig. 7. Fig. 7 is a schematic diagram illustrating pseudo code according to an embodiment of the present disclosure.

As shown in fig. 7, embodiments of the present disclosure are capable of optimizing the workflow of compiling an input tensor program in an end-to-end manner.

Specifically, an input tensor program P and a transformation rule set IR are given. The purpose of the embodiments of the present disclosure is to optimize the input tensor program P to a tensor program P _opt 。

As shown in line 5 of the pseudo code, for the input tensor program P, embodiments of the present disclosure first use non-nativeThe linear activation operator splits it as a split point into a number of subroutines, which constitute a set of subroutines SP, each corresponding to a computational graph. As described in operation S201, each computational graph is converted into an initial intermediate representation E ₀ 。

Lines 7 to 13 of the pseudo code correspond to operations S202 to S203 described above. Specifically, in operation S202, a plurality of equivalent intermediate representations equivalent to the intermediate representation will be determined based only on the search space defined by the multiple intermediate representation rules in the transformation rule set. Then, in operation S203, for each equivalent intermediate representation of the plurality of equivalent intermediate representations, a transformation rule enabling a reduction in a distance between the transformed equivalent intermediate representation and an intermediate representation of a preset operator is searched in a search space defined based on a single intermediate representation rule of a transformation rule set; performing equivalent transformation on the equivalent intermediate representation by using the searched transformation rule to obtain an updated equivalent intermediate representation; and selecting the target intermediate representation from the updated plurality of equivalent intermediate representations.

First, as shown in line 7 of the pseudo code, embodiments of the present disclosure will represent E for each initial intermediate representation ₀ And respectively optimizing. In line 8 of the pseudo code, a search queue Q is constructed and the initial intermediate representation E ₀ And storing the data into a queue. Since the computational graph may contain multiple operators, the initial intermediate representation E ₀ Multiple operators are also included. In line 11 of the pseudo code, embodiments of the present disclosure generate an equivalent intermediate representation based on the transformation rules between operators for each intermediate representation E in Q, and add the generated equivalent intermediate representation to the search queue. Meanwhile, in line 12 of the pseudo code, each intermediate representation E is provided to a search based on intermediate representation distance, an equivalent intermediate representation is generated by a transformation rule within the operator, and the generated equivalent intermediate representation is added to the search queue. In 13 lines of pseudo code, for each computational graph, embodiments of the present disclosure select the best performing intermediate representation in the search state Q as the target intermediate representation and add it to the target intermediate representation set P _opt 。

The pseudo code line 14 corresponds to the operation S204 described above. Embodiments of the present disclosure represent the set P for the target intermediate _opt Post-processing (e.g., operator fusion, etc.) is performed to obtain executable code.

Particular embodiments of the present disclosure may prioritize the conversion of intermediate representations that may be mapped to intermediate representations of preset operators, rather than optimally integrating inter-operator transformation rules and intra-operator transformation rules into a unified search space to perform joint searches. Thus, embodiments of the present disclosure can find a promising transition as early as possible and prune unnecessary search states according to the execution time of the transition result.

According to still another aspect of the present disclosure, there is also provided a compiling apparatus including: the first module is used for acquiring a calculation graph in the tensor program to be compiled and converting the calculation graph into an initial intermediate representation; a second module for determining a plurality of equivalent intermediate representations equivalent to the intermediate representation based on a search space defined by a set of transformation rules; a third module for determining a target intermediate representation using the plurality of equivalent intermediate representations based on a distance between an intermediate representation of a preset operator and the equivalent intermediate representation; and a fourth module for determining executable code corresponding to the computational graph based on the target intermediate representation.

According to yet another aspect of the present disclosure, there is also provided an electronic device for implementing the method 20 according to an embodiment of the present disclosure or carrying the apparatus according to an embodiment of the present disclosure. Fig. 8 shows a schematic diagram of an electronic device 2000 in accordance with an embodiment of the present disclosure.

As shown in fig. 8, the electronic device 2000 may include one or more processors 2010, and one or more memories 2020. Wherein said memory 2020 has stored therein computer readable code which, when executed by said one or more processors 2010, can perform a compiling method as described above.

The processor in embodiments of the present disclosure may be an integrated circuit chip having signal processing capabilities. The processor may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The methods, operations, and logic blocks of the disclosure in the embodiments of the present disclosure may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, and may be of the X86 architecture or ARM architecture.

In general, the various example embodiments of the disclosure may be implemented in hardware or special purpose circuits, software, firmware, logic, or any combination thereof. Some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device. While aspects of the embodiments of the present disclosure are illustrated or described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that the blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

For example, a method or apparatus according to embodiments of the present disclosure may also be implemented by means of the architecture of computing device 3000 shown in fig. 8. As shown in fig. 9, computing device 3000 may include a bus 3010, one or more CPUs 3020, a Read Only Memory (ROM) 3030, a Random Access Memory (RAM) 3040, a communication port 3050 connected to a network, an input/output component 3060, a hard disk 3070, and the like. A storage device in the computing device 3000, such as a ROM 3030 or hard disk 3070, may store various data or files for processing and/or communication of the methods provided by the present disclosure and program instructions for execution by the CPU. The computing device 3000 may also include a user interface 3080. Of course, the architecture shown in FIG. 9 is merely exemplary, and one or more components of the computing device shown in FIG. 9 may be omitted as may be practical in implementing different devices.

According to yet another aspect of the present disclosure, a computer-readable storage medium is also provided. Fig. 10 shows a schematic diagram of a storage medium 4000 according to the present disclosure.

As shown in fig. 10, the computer storage medium 4020 has stored thereon computer readable instructions 4010. When the computer readable instructions 4010 are executed by a processor, a method according to an embodiment of the disclosure described with reference to the above figures may be performed. The computer readable storage medium in embodiments of the present disclosure may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (ddr SDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), synchronous Link Dynamic Random Access Memory (SLDRAM), and direct memory bus random access memory (DR RAM). It should be noted that the memory of the methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory. It should be noted that the memory of the methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

The disclosed embodiments also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. A processor of a computer device reads the computer instructions from a computer-readable storage medium, the processor executing the computer instructions, causing the computer device to perform a method according to an embodiment of the present disclosure.

It is noted that the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The exemplary embodiments of the present disclosure described in detail above are illustrative only and are not limiting. Those skilled in the art will understand that various modifications and combinations of these embodiments or features thereof may be made without departing from the principles and spirit of the disclosure, and such modifications should fall within the scope of the disclosure.

Claims

1. A compiling method, comprising:

acquiring a calculation graph in a tensor program to be compiled, and converting the calculation graph into an initial intermediate representation;

Determining a plurality of equivalent intermediate representations equivalent to the intermediate representation based on a search space defined by a set of transformation rules;

determining a target intermediate representation using the plurality of equivalent intermediate representations based on a distance between an intermediate representation of a preset operator and the equivalent intermediate representation; and

and determining executable codes corresponding to the calculation graph based on the target intermediate representation.

2. The compiling method of claim 1, wherein the acquiring a computational graph in a tensor program to be compiled and converting the computational graph into an initial intermediate representation comprises:

dividing the tensor program to be compiled into at least one calculation graph by taking a nonlinear activating operator in the tensor program to be compiled as a dividing point; and

each of the at least one computational graph is converted to an initial intermediate representation based on the each computational graph.

3. The compiling method of claim 1, wherein the intermediate representation is used for describing a calculation content that the tensor calculation program needs to execute, the intermediate representation including at least one of: an indicator for indicating a sequential value space for element traversal, an indicator for indicating a value space for an iteration variable, an indicator for indicating a memory variable layout.

4. The compilation method of claim 1, wherein the transformation rule set comprises a multi-intermediate representation rule and a single intermediate representation rule, wherein,

the multi-intermediate representation rule includes at least one of: rules for splitting the intermediate representation, rules for merging the intermediate representations, and rules for merging the intermediate representations;

the single intermediate representation rule is at least one of: a rule for summing and splitting the intermediate representation, a rule for variable substitution of the intermediate representation, a rule for traversing and merging the intermediate representation, a rule for boundary relaxation of the intermediate representation, and a rule for boundary tightening of the intermediate representation.

5. The compilation method of claim 1, wherein the determining a plurality of equivalent intermediate representations equivalent to the intermediate representation based on a search space defined by a set of transformation rules comprises:

based on a maximum search depth, a search is performed in the search space to determine a plurality of equivalent intermediate representations equivalent to the intermediate representation.

6. The compiling method according to claim 1, wherein the calculating of the distance between the intermediate representation of the preset operator and the equivalent intermediate representation comprises:

Determining a first set of iteration variables in an iterator of a plurality of tensors associated with the preset operator based on an intermediate representation of the preset operator;

determining, based on the equivalent intermediate representation, a second set of iteration variables in an iterator of a plurality of tensors related to operators in the equivalent intermediate representation; and

based on the difference between the first set and the second set, a distance between an intermediate representation of the preset operator and the equivalent intermediate representation is calculated.

7. The compiling method according to claim 1, wherein the determining the target intermediate representation using the plurality of equivalent intermediate representations based on a distance between the intermediate representation of the preset operator and the equivalent intermediate representation comprises:

for each equivalent intermediate representation of the plurality of equivalent intermediate representations,

searching the search space for a transformation rule capable of reducing a distance between the transformed equivalent intermediate representation and an intermediate representation of a preset operator;

performing equivalent transformation on the equivalent intermediate representation by using the searched transformation rule to obtain an updated equivalent intermediate representation; and

the target intermediate representation is selected from the updated plurality of equivalent intermediate representations.

8. The compiling method of claim 1, wherein the determining executable code corresponding to the computational graph based on the target intermediate representation comprises:

directly converting a part matched with a preset operator in the target intermediate representation into an executable code so as to call the preset operator; and

and converting the part, which is not matched with the preset operator, of the target intermediate representation into an executable code through a code generation framework.

9. The compiling method of claim 2, wherein the determining executable code corresponding to the computational graph based on the target intermediate representation comprises:

and splicing the target intermediate representations corresponding to the plurality of calculation graphs to determine the calculation graph corresponding to the tensor program.

10. The method of claim 4, wherein the determining a plurality of equivalent intermediate representations equivalent to the intermediate representation based on a search space defined by a set of transformation rules comprises:

a plurality of equivalent intermediate representations equivalent to the intermediate representation are determined based on a search space defined by a plurality of intermediate representation rules in a set of transformation rules.

11. The method of claim 10, wherein determining the target intermediate representation using the plurality of equivalent intermediate representations based on a distance between the intermediate representation of the preset operator and the equivalent intermediate representation comprises:

searching for transformation rules capable of reducing a distance between the transformed equivalent intermediate representation and an intermediate representation of a preset operator in a search space defined based on a single intermediate representation rule in a set of transformation rules;

12. A compiling apparatus, comprising:

the first module is used for acquiring a calculation graph in the tensor program to be compiled and converting the calculation graph into an initial intermediate representation;

a second module for determining a plurality of equivalent intermediate representations equivalent to the intermediate representation based on a search space defined by a set of transformation rules;

a third module for determining a target intermediate representation using the plurality of equivalent intermediate representations based on a distance between an intermediate representation of a preset operator and the equivalent intermediate representation; and

and a fourth module, configured to determine an executable code corresponding to the computation graph based on the target intermediate representation.

13. An electronic device, comprising:

a processor; and

a memory, wherein the memory has stored therein a computer executable program which, when executed by the processor, performs the compiling method of any one of claims 1 to 11.

14. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the compiling method of any one of claims 1 to 11.

15. A computer program product comprising computer instructions stored in a computer readable storage medium. A processor of a computer device reads the computer instructions from a computer readable medium, the processor executing the computer instructions, causing the computer device to perform the compiling method of any of claims 1-11.