CN117008916A

CN117008916A - Tensor program optimization method and device

Info

Publication number: CN117008916A
Application number: CN202310827561.8A
Authority: CN
Inventors: 翟季冬; 马子轩
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2023-07-06
Filing date: 2023-07-06
Publication date: 2023-11-07

Abstract

The invention provides a tensor program optimization method and device, and relates to the technical field of deep learning, wherein the method comprises the following steps: acquiring a calculation graph corresponding to a tensor program to be optimized; obtaining a computing performance limited operator and a memory access performance limited operator in the computing graph by using a performance model; invoking a hardware operator library to calculate the calculation performance limited operator to generate a first code; optimizing the access performance limited operator by using a graph intermediate representation, generating a second code, wherein edges in the graph intermediate representation represent data pieces on a specified memory hierarchy, and nodes in the graph intermediate representation represent one or a group of hardware instructions for executing specified operations; and merging the first code and the second code to generate an optimized tensor program. The tensor program optimizing method and device provided by the invention can optimize the memory performance, and the optimizing effect is more efficient than that of the prior art.

Description

Tensor program optimization method and device

Technical Field

The invention relates to the technical field of deep learning, in particular to a tensor program optimization method and device based on explicit data movement and instruction level diagram intermediate representation.

Background

Currently, memory performance is becoming a bottleneck for deep neural network applications. A great deal of effort has been devoted to providing an optimized solution to the memory performance problem. Tensor compilers are a common class of solutions. Such work improves the efficiency of program execution by representing the tensor program as an intermediate representation and optimizing based on the intermediate representation. Depending on the level of abstraction of the intermediate representation, the related work mainly consists of two classes: one class is mainly directed to code organization inside operators, called tensor compilers for operators, and another class is mainly directed to data multiplexing and operator merging between operators, called optimization based on computational graph transformation.

Existing tensor compilers for operators such as TVM, ansor and the like work, and a Halide abstraction is adopted to describe a program as a mode of calculation and scheduling. By searching for the appropriate combination, an appropriate layout and execution mode of the computation operator are found. However, this mode has the following drawbacks:

1. existing tensor compilers lack explicit description of data movement operations and therefore cannot directly optimize memory operations. For example, in TVM, the intermediate representation is based on loops, and the program needs to perform equivalent transformation first and then map to different memory hierarchies, so that the optimization process is difficult to optimize for memory performance pertinently.

2. The existing tensor compiler is not large enough in search space and is difficult to cover various memory operations of the memory. For example, in Ansor, the programs need to be combined before searching. This search order makes the optimization process unable to search all possible combinations of memory operating parameters, and thus may not find the optimal combinations of operations, resulting in inefficient memory usage.

Another class of optimization for operator merging, such as TensorRT, DNNFusion, etc., computes merging opportunities between operators in a layer explorer. And by analyzing the data dependency relationship among operators, a plurality of calculation operators are combined into one, so that the whole memory usage is reduced. It still has the following drawbacks:

1. computational graph optimization is limited to backend operators. Whether based on operator libraries (e.g., tensorRT) or code generation jobs (e.g., DNNFusion), such jobs require a merge rule, i.e., the optimization process needs to know in advance which operator combinations can be replaced with which new operator combinations. Thus, the optimization process of such work is limited by the back-end supported computing model and is difficult to extend to new models. On the other hand, it is difficult to migrate to other hardware platforms. Such work is therefore difficult to extend.

2. On the other hand, the granularity of such work data dependent analysis is too coarse, and the program can only analyze the multiplexing relation between tensors. Data multiplexing on complex memory hierarchies is typically more granular, requiring finer granularity data dependency analysis. Therefore, it is difficult to optimize program performance using a high-level memory hierarchy.

Disclosure of Invention

In view of the above, the present invention provides a tensor program optimization method and apparatus to solve at least one of the above-mentioned problems.

In order to achieve the above purpose, the present invention adopts the following scheme:

according to a first aspect of the present invention, there is provided a tensor program optimization method, the method comprising: acquiring a calculation graph corresponding to a tensor program to be optimized; obtaining a computing performance limited operator and a memory access performance limited operator in the computing graph by using a performance model; invoking a hardware operator library to calculate the calculation performance limited operator to generate a first code; optimizing the access performance limited operator by using a graph intermediate representation, generating a second code, wherein edges in the graph intermediate representation represent data pieces on a specified memory hierarchy, and nodes in the graph intermediate representation represent one or a group of hardware instructions for executing specified operations; and merging the first code and the second code to generate an optimized tensor program.

As one embodiment of the present invention, the obtaining the computation performance limited operator and the access performance limited operator in the computation graph by using the performance model in the method includes: predicting the calculation time and the access time of each operator in the calculation graph by using a performance model; determining whether each operator belongs to a computational performance limited operator or a memory performance limited operator based on the computation time and the memory time.

As an embodiment of the present invention, the specifying operation in the above method includes: a data movement operation, a calculation operation, a synchronization operation, or a dimension transformation operation.

As one embodiment of the present invention, in the above method, optimizing the access performance limiting operator using the graph intermediate representation, generating the second code includes: all access performance limited operators form a plurality of subgraphs without external dependency relations; for each operator in each subgraph, acquiring all graph intermediate representations capable of realizing the operator; combining the graph intermediate representations of all operators in each sub-graph to obtain graph intermediate representation combinations corresponding to the sub-graph; performing graph rewriting on the graph intermediate representation combination based on a preset rule to obtain an optimized graph intermediate representation combination; selecting a plurality of alternative graph intermediate representation combinations meeting the requirements from the optimized graph intermediate representation combinations by utilizing a performance model; a second code is generated based on a combination of several alternative graph intermediate representations of each sub-graph.

As an embodiment of the present invention, the preset rule in the above method includes a first rule, a second rule, and a third rule, where: the first rule is: when a plurality of data moving operations access the same data sheet, adding a synchronous operation corresponding to the scope according to the scope of the moving operation; the second rule is: replacing the write-synchronize-read operation with a synchronize operation and replacing the synchronize-read operation with a synchronize-read operation; the third rule is: operations that do not affect memory performance are forced to be swapped with operations that are swappable with their precursors.

As an embodiment of the present invention, in the above method, performing graph rewriting on the graph intermediate representation combination based on a preset rule, and obtaining an optimized graph intermediate representation combination includes: and optimizing the graph intermediate representation combination according to the sequence of the first rule, the third rule and the second rule in turn until all the rules can not continue optimizing the graph intermediate representation combination, so as to obtain the optimized graph intermediate representation combination.

As an embodiment of the present invention, generating the second code based on the combination of the several alternative graph intermediate representations of each sub-graph in the above method includes: generating a kernel for each of the intermediate representations in the combination of alternative intermediate representations; selecting a topology order for each intermediate representation that minimizes the total amount of memory slices that are activated at the same time; generating an operation instruction in a kernel program by the nodes in each graph intermediate representation according to the topological order; mapping each graph intermediate representation to specific hardware, and generating operation instructions on the specific hardware based on the operation instructions, thereby forming codes of alternative graph intermediate representation combinations; performing performance test on the codes of the intermediate representation combination of the alternative graph of each sub-graph, and selecting the optimal code of each sub-graph as a second code.

According to a second aspect of the present invention, there is provided a tensor program optimization apparatus, the apparatus comprising: the calculation map acquisition unit is used for acquiring a calculation map corresponding to the tensor program to be optimized; the operator classifying unit is used for obtaining a computing performance limited operator and a memory access performance limited operator in the computing graph by utilizing a performance model; the computing performance optimizing unit is used for calling a hardware operator library to calculate the computing performance limited operator and generating a first code; the access performance optimizing unit is used for optimizing the access performance limited operator by using a graph intermediate representation, generating a second code, wherein edges in the graph intermediate representation represent data sheets on a specified memory hierarchical structure, and nodes in the graph intermediate representation represent one or a group of hardware instructions for executing specified operations; and the code merging unit is used for merging the first code and the second code to generate an optimized tensor program.

As an embodiment of the present invention, the operator classifying unit in the above apparatus includes: the performance calculation module is used for predicting the calculation time and the access time of each operator in the calculation graph by utilizing the performance model; and the operator classification module is used for determining whether each operator belongs to a computational performance limited operator or a memory access performance limited operator based on the computation time and the memory access time.

As an embodiment of the present invention, the operations specified in the above apparatus include: a data movement operation, a calculation operation, a synchronization operation, or a dimension transformation operation.

As an embodiment of the present invention, the access performance optimizing unit in the above apparatus includes: the sub-graph acquisition module is used for forming a plurality of sub-graphs without external dependency relations by all access performance limited operators; a diagram intermediate representation acquisition module, configured to acquire, for each operator in each subgraph, all diagram intermediate representations in which the operator can be implemented; the combination module is used for combining the graph intermediate representations of all operators in each sub-graph to obtain graph intermediate representation combination corresponding to the sub-graph; the diagram rewriting module is used for performing diagram rewriting on the diagram intermediate representation combination based on a preset rule to obtain an optimized diagram intermediate representation combination; the alternative combination selection module is used for selecting a plurality of alternative graph intermediate representation combinations meeting the requirements from the optimized graph intermediate representation combinations by utilizing a performance model; and the second code generation module is used for generating a second code based on a plurality of alternative graph intermediate representation combinations of each sub-graph.

As an embodiment of the present invention, the preset rule in the above device includes a first rule, a second rule, and a third rule, where: the first rule is: when a plurality of data moving operations access the same data sheet, adding a synchronous operation corresponding to the scope according to the scope of the moving operation; the second rule is: replacing the write-synchronize-read operation with a synchronize operation and replacing the synchronize-read operation with a synchronize-read operation; the third rule is: operations that do not affect memory performance are forced to be swapped with operations that are swappable with their precursors.

As an embodiment of the present invention, the graph rewriting module in the above apparatus is specifically configured to: and optimizing the graph intermediate representation combination according to the sequence of the first rule, the third rule and the second rule in turn until all the rules can not continue optimizing the graph intermediate representation combination, so as to obtain the optimized graph intermediate representation combination.

As an embodiment of the present invention, the second code generating module in the above apparatus includes: a kernel generation sub-module for generating a kernel for each of the intermediate representations in the candidate intermediate representation combination; a topology sequence selection sub-module, configured to select, for each intermediate representation of the graph, a topology sequence that minimizes a total amount of memory slices that are activated at a same time; the operation instruction generation sub-module is used for generating operation instructions in the kernel program according to the topological order by the nodes in each graph intermediate representation; the mapping sub-module is used for mapping each graph intermediate representation to specific hardware, generating an operation instruction on the specific hardware based on the operation instruction, and forming a code of an alternative graph intermediate representation combination; and the performance test sub-module is used for performing performance test on the codes of the intermediate representation combination of the alternative graphs of each sub-graph, and selecting the optimal code of each sub-graph as a second code.

According to a third aspect of the present application there is provided an electronic device comprising a memory, a processor and a computer program stored on said memory and executable on said processor, the processor implementing the steps of the above method when executing said computer program.

According to a fourth aspect of the present application there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the above method.

As can be seen from the above technical solutions, the tensor program optimizing method and apparatus provided by the present application can optimize the memory performance, and the optimizing effect is more efficient than that of the prior art, and since the edges in the middle representation of the graph represent the data slices on the hierarchical structure of the specified memory, the nodes in the middle representation of the graph represent one or a group of hardware instructions for executing the specified operation, the middle representation of the graph based on the explicit data movement and instruction level can also bring the following beneficial effects:

1. the data movement operation is explicitly represented, so that optimization for the middle representation of the graph can directly affect the memory operation performance of the program, thus facilitating subsequent searches and optimizations.

2. Using the graph-based intermediate representation, dependencies between pieces of data can be clearly represented. Based on this representation, the scope of the data can be easily analyzed to determine on which level of memory hierarchy the data should be stored, thereby improving program performance.

3. The granularity of description of the instruction level is more accurate than the operator level. The layers of the graph and operators may be jointly computed to generate efficient tensor program code. Covering a larger search space so that more elaborate implementations can be searched.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. In the drawings:

FIG. 1 is a flow chart of a tensor program optimization method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a computational graph provided by an embodiment of the present application;

FIG. 3 is a flow chart of classification of performance limitations using a performance model provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of a GIR provided by an embodiment of the application;

FIG. 5 is a schematic flow chart of optimizing access performance limiting operators using GIR according to an embodiment of the present application;

FIG. 6 is a code diagram of an optimization process for a memory performance limited operator provided by an embodiment of the present application;

FIG. 7 is a schematic flow chart of generating a second code from an alternative GIR combination provided by an embodiment of the application;

FIG. 8 is a schematic diagram of a scope analysis method according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a process for generating a calculation map to GIR according to an embodiment of the present application;

FIG. 10 is a schematic view of a GIR fragment according to an embodiment of the application;

FIG. 11 is a schematic diagram of an optimization process of an attention module of the system provided by the embodiment of the application when GPT-2 is used for generating a model;

FIG. 12 is a schematic diagram comparing the optimization method of the present application with the optimization effect of the prior art provided by the embodiment of the present application;

FIG. 13 is a schematic diagram of a tensor program optimizing device according to an embodiment of the present application;

FIG. 14 is a schematic diagram of an operator taxonomy provided by an embodiment of the present application;

FIG. 15 is a schematic diagram of a memory access performance optimization unit according to an embodiment of the present application;

FIG. 16 is a schematic diagram of a second code generation module according to an embodiment of the present application;

fig. 17 is a schematic block diagram of a system configuration of an electronic device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings. The exemplary embodiments of the present application and their descriptions herein are for the purpose of explaining the present application, but are not to be construed as limiting the application.

Fig. 1 is a schematic flow chart of a tensor program optimization method according to an embodiment of the present application, where the embodiment is described from a tensor program optimization system side, and the method includes the following steps:

step S101: and obtaining a calculation graph corresponding to the tensor program to be optimized.

A computational graph is a graph made up of a set of nodes and edges, which is a data structure used to represent a computational process, where the nodes represent operators and the edges represent data flows. Fig. 2 is a schematic diagram of a calculation diagram, which includes four operators, namely, a linear rectification function (RELU), a join operation (Concat), a matrix transposition (transition), and a Split (Split).

Step S102: and obtaining a computing performance limited operator and a memory access performance limited operator in the computing graph by using a performance model.

According to the performance model of the embodiment, all operators in the input calculation graph can be classified according to the calculation performance limitation and the access performance limitation according to the difference of target hardware platforms. The performance of the operator with limited computing performance is mainly affected by the computing amount, and is mainly dependent on the computing performance of hardware, such as most matrix multiplication, two-dimensional convolution and the like. The performance of the operator with limited access performance is mainly affected by the memory access amount and access performance, and mainly depends on the memory performance of hardware, such as point-by-point calculation, transposition, protocol and other operations.

Preferably, as shown in fig. 3, this step may further comprise the sub-steps of:

step S1021: and predicting the calculation time and the access time of each operator in the calculation graph by using a performance model.

Step S1022: determining whether each operator belongs to a computational performance limited operator or a memory performance limited operator based on the computation time and the memory time. Specifically, for example, the calculated amount and the access amount of the program can be obtained based on the calculated time and the access time predicted by the performance model, the ratio of the calculated amount to the access amount is compared with a preset threshold, and when the calculated amount is higher than the preset threshold, the calculated amount and the access amount are considered to belong to the calculation performance limited operator, otherwise, the calculated amount and the access amount are considered to be the access limited performance operator.

Step S103: and calling a hardware operator library to calculate the calculation performance limited operator, and generating a first code.

Step S104: optimizing the access performance limited operator using a graph intermediate representation, wherein edges in the graph intermediate representation represent pieces of data on a specified memory hierarchy, and nodes in the graph intermediate representation represent one or a set of hardware instructions for performing specified operations.

The solution of this embodiment is mainly an optimization of tensor program memory performance, but the optimization is based on the above-mentioned intermediate representation of the graph, i.e. the intermediate representation of the graph based on explicit data movement and Instruction level, which is replaced by GIR (Instruction-Level Graph Intermediate Representation with Explicit Data Movement Description) for convenience of description hereinafter.

Fig. 4 is a schematic diagram of a GIR according to an embodiment of the present application, which includes a complete GIR and a simplified GIR, wherein the edges (i.e. arrows with pointing directions) represent a data slice on a specific memory hierarchy, for example, D0 in fig. 4 represents a data slice of group p×q on DRAM. While nodes in the GIR represent one or a set of hardware instructions for performing specific specified operations, such as MOVE operations and RELU operations in fig. 4. Preferably, the designating operation herein may include a synchronization operation or a dimension transformation operation in addition to the above-described moving operation and calculating operation. Each GIR also contains a parallelism n, which identifies that the operation described by the GIR was performed n times in parallel.

Such a GIR description as described above may bring the following advantages to the present application: on the one hand, the data move operation is explicitly represented, so that the optimization for the GIR can directly affect the memory operation performance of the program, thus facilitating the subsequent search and optimization. On the one hand, using the graph-based intermediate representation, the dependency relationship between the pieces of data can be clearly represented. Based on this representation, the scope of the data can be easily analyzed to determine on which level of memory hierarchy the data should be stored, thereby improving program performance. On the other hand, the instruction level description granularity is more accurate than the operator level. The layers of the graph and operators may be jointly computed to generate efficient tensor program code. Covering a larger search space so that more elaborate implementations can be searched.

The following describes how to optimize the access performance limitation operator based on the above-mentioned GIR in further detail, as shown in fig. 5, which is a schematic flow chart of optimizing the access performance limitation operator by using the GIR according to an embodiment of the present application, the flow chart includes the following sub-steps:

step S1041: all access performance limited operators are formed into a plurality of subgraphs without external dependency relations.

In this embodiment, if a part of the access performance-limited operators only have internal relations and no external dependency relations, the part of the performance operators may be classified into the same sub-graph. The subsequent operations are all operations performed on the subgraphs, and then the codes of each subgraph are combined to obtain codes optimized by all performance-limited operators in the calculation graph, namely second codes.

Step S1042: for each operator within each subgraph, all GIRs that can implement the operator are acquired.

In this embodiment, for each operator within the subgraph, all possible GIRs may be generated according to their computation mode, e.g., a Transpost operator may generate different implementations and different parameter sizes. Thus, each operator corresponds to several GIR codes.

Step S1043: and combining the GIRs of all operators in each sub-graph to obtain the GIR combination corresponding to the sub-graph.

Because each operator corresponds to a plurality of GIRs, there are various ways for combining GIRs between operators, for example, there are two operators, operator 1 and operator 2, in a certain sub-graph, where operator 1 has 3 GIRs and operator 2 has 2 GIRs, and 6 GIR combinations will occur in the sub-graph.

Step S1044: and carrying out graph rewriting on the GIR combination based on a preset rule to obtain an optimized GIR combination.

The graph rewriting in the step is that the GIR combination obtained in the step is optimized to reduce the access quantity to the memory in the code corresponding to the GIR combination as much as possible, so as to improve the memory performance. The graph rewriting of the present embodiment is based on three preset rules, based on which three transformations are allowed for the GIR. Here, we refer to these three preset rules as a first rule, a second rule, and a third rule, wherein:

the first rule is: when multiple data moving operations access the same data sheet, a synchronous operation of a corresponding scope is added according to the scope of the moving operation. For example, two data movement operations accessing a piece of data on the same DRAM may add a synchronization operation, which may be a thread, a warp, or the entire device on the GPU, depending on the mode of access. This operation ensures that the scope of application of the data must be reduced and the hierarchy of data storage must be increased.

The second rule is: replacing the write-synchronize-read operation with a synchronize operation and replacing the synchronize-read operation with a synchronize-read operation; such a transformation may ensure that the total access to memory must be reduced.

The third rule is: operations that do not affect memory performance are forced to be swapped with operations that are swappable with their precursors. Operations that do not affect the memory performance herein may include, for example, point-by-point operations, transpose operations, protocol operations, broadcast operations, etc., and by forcing such operations to be swapped with precursor swappable operations, it may be ensured that adjacent operations that may be eliminated are eliminated. The rule is irreversible because it defines the direction of the exchange.

After defining the three rules, the application designs a heuristic algorithm, sequentially applies the three rules until the program can not continue to optimize, and the code of the optimizing process is shown in fig. 6, specifically, the GIR combination is optimized sequentially according to the sequence of the first rule, the third rule and the second rule until all the rules can not continue to optimize the GIR combination, and finally the optimized GIR combination is obtained. The complexity of this optimization process is O (N+M), where N is the number of nodes of the GIR and M is the number of edges of the GIR, which allows the optimization process of the GIR combination to be performed quickly.

Step S1045: and selecting a plurality of alternative GIR combinations meeting the requirements from the optimized GIR combinations by using a performance model.

Since the number of GIR combinations obtained after step S1044 may be relatively large, the present application continues to apply the performance model to analyze the performance of the optimized GIR combinations on the basis of the above, and selects a predetermined number of GIR combinations with the best performance as the candidate GIR combinations for subsequent code generation.

Step S1046: a second code is generated based on several alternative GIR combinations for each sub-graph.

This step involves the optimization of the generation of the GIR codes, i.e. the selection of the best GIR combination from the above mentioned alternative GIR combinations is required to generate the second code.

Preferably, as shown in fig. 7, the present step may further include the following sub-steps:

step S10461: a kernel is generated for each intermediate representation of the graph in the candidate GIR combinations.

Step S10462: a topology order is selected for each GIR that minimizes the total amount of memory slices that are activated at the same time. The topology selection criteria here is to minimize the total amount of memory slices that are activated at the same time, i.e. to reorder the graph after it has been overwritten.

Step S10463: and generating an operation instruction in the kernel program by the nodes in each GIR according to the topological order.

Step S10464: each GIR is mapped to specific hardware, and the operation instructions on the specific hardware are generated based on the operation instructions, so that codes of alternative GIR combinations are formed.

The operation instruction generated in the above step S10463 is a general-purpose instruction, but the tensor program needs to run on a specific hardware platform, and thus each GIR needs to be mapped onto specific hardware. For example, on an NVIDIA GPU, parallelism would be mapped to the parallel granularity of warp, and the block and device numbers to which each warp belongs would be analyzed in the warp-block-device mode, and GPU code would be generated. Whereas on the chilly MLU, each parallelism would be mapped onto one pipeline stage, taking into account that it must use pipelined parallelism in the kernel program. And analyzes its scope for each data movement operation in a pipeline-core-device fashion. FIG. 8 is a schematic diagram of a domain analysis mode provided by the present application, which is a domain analysis performed according to the mode of warp-block-device. The figure depicts the synchronization operations required for different combinations of memory access operations on the NVIDIA GPU. Different grayscales represent different warp involved in the memory access operation and different shapes represent different memory access modes. In the figure, the first line shows the case where synchronization is required in warp, the second line shows the case where synchronization is required in block, and the third line shows the case where synchronization is required in device.

It can be seen that the above-mentioned scope analysis method abstracts different devices to some constraint conditions in the search process, so that the optimization process of the system does not need to make any modification for different platforms. On this basis, the system will directly generate an instruction or set of instructions on a particular device for the computing operation on each GIR. Thus, when the system is migrated to a new hardware platform, only translation of different types of operations into instructions needs to be implemented, resulting in a very low amount of code, e.g., no more than 200 lines of code are needed for the system to migrate to the NVIDIA GPU.

Step S10465: and performing performance test on codes of the alternative GIR combinations of each sub-graph, and selecting the optimal code of each sub-graph as a second code.

After the codes of the alternative GIR combinations are obtained, performing performance test on each code, selecting the optimal code of each sub-graph as a second code, and combining the first codes of the computing performance limiting operators to generate the program of the end-to-end model. This program is implemented using a native language for the particular hardware, and can be compiled directly into operation.

Step S105: and merging the first code and the second code to generate an optimized tensor program.

As can be seen from the above technical solutions, the tensor program optimization method provided by the present application can optimize the memory performance, and the optimization effect is more efficient than that of the prior art, and because the edges in the middle representation of the graph represent the data pieces on the specified memory hierarchy, the nodes in the middle representation of the graph represent one or a group of hardware instructions for executing the specified operations, the middle representation of the graph based on the explicit data movement and instruction level can also bring the following beneficial effects:

The above steps are further described below by way of several specific examples.

Fig. 9 is a schematic diagram of a process from a calculation graph to generation of a GIR according to an embodiment of the present application, where, as shown in fig. 9, for an activation function sizu, there are two operators Sigmoid and Mul, for which each of the operators Sigmoid and Mul generates a plurality of candidate GIRs, and then each of the combinations is searched for, and then the operators Sigmoid and Mul are combined to generate a plurality of candidate GIR combinations, and fig. 9 illustrates that one GIR is selected for each of the operators Sigmoid and Mul, and then two synchronization operations are added between two data movement operations by using a first rule, so as to obtain a combined GIR combination.

FIG. 10 is a schematic diagram of a GIR segment showing the connection relationship between the operations common to GIR, wherein the original GIR cannot be further optimized due to the point-by-point operation of the blank circle. However, after the third rule is applied to perform the swap operation, a synchronization operation is added between the DRAM Read B and the DRAM Write B based on the first rule, then the DRAM Read B, the DRAM Write B and the intermediate synchronization operation may be converted into a synchronization operation using the second rule, then the DRAM Write a, the DRAM Read a and the intermediate synchronization operation may be continuously converted into a synchronization operation according to the second rule, then the point-by-point operation and the synchronization operation are left, and after the synchronization operation is removed, an optimized point-by-point operation is left. This example demonstrates that the GIR optimization process can effectively reduce the memory access of the overall program.

FIG. 11 shows an optimization process of the attention module when GPT-2 is used to generate a model in the system according to the embodiment of the present application. For two matrix multiplication (MatMul) operations in the computational graph, the performance model determines that it is a memory performance limited operation and generates several alternative GIRs for both operations. In the best performing GIR combinations, the two matrix multiplication operations use different GIR implementations, with the specific difference being the different dimensions of the bcast and reduce operations. The system may generate the entire attention operator as a kernel code after optimizing the GIR combinations. On NVIDIA GPU, the performance improvement in this example is up to 1.98 times as compared to TensorRT, TVM, etc

Fig. 12 is a schematic diagram showing the comparison between the optimization method of the present application and the optimization effect of the prior art, in which three platforms, NVIDIA TESLA A GPU, AMD MI100 GPU and katana MLU-370, are used in the present embodiment. Wherein the first row in FIG. 12 is an end-to-end performance comparison plot over NVIDIA TESLA A100 GPU, the second row is an end-to-end performance comparison plot over AMD MI100 GPU, and the third row is an end-to-end performance comparison plot over the Han's MLU-370.

7 real DNN models were used in the test procedure experiments: BERT, viT, GPT2, SAR-DRN, efficientNet, shuffleNet and RedNet-50, wherein BERT, viT, GPT2 are based on a transducer and are applied to natural language processing, image recognition and the like. SAR-DRN, efficientNet, shuffleNet, redNet-50 is based on CNN structure and is oriented to various image tasks such as image classification, super resolution and the like.

The prior art compared with the present application is a: pyTorch, B: torchScript, C: tensorFlow, D: TF-XLA, E: tensort, F: TVM (Ansor), G: magicind, while the present application is shown in the figures with the designation H.

As can be seen from fig. 12, the present application can achieve up to 1.98 times the acceleration ratio compared to the best currently performing tensort on the end-to-end experiment on the NVIDIA a100 GPU. Moreover, as can be seen from the entirety of fig. 12, the acceleration ratio of the present application is in the forward position in all kinds of platforms. Therefore, the application is obviously better than the prior art in the memory performance optimization.

Fig. 13 is a schematic structural diagram of a tensor program optimizing device according to an embodiment of the present application, where the device includes: the system comprises a calculation map acquisition unit 100, an operator classification unit 200, a calculation performance optimization unit 300, a memory access performance optimization unit 400 and a code merging unit 500, wherein the operator classification unit 200 is respectively connected with the calculation map acquisition unit 100, the calculation performance optimization unit 300 and the memory access performance optimization unit 400, and the code merging unit 500 is respectively connected with the calculation performance optimization unit 300 and the memory access performance optimization unit 400.

The calculation map obtaining unit 100 is configured to obtain a calculation map corresponding to a tensor program to be optimized.

The operator classifying unit 200 is configured to obtain a computational performance limited operator and a memory performance limited operator in the computational graph by using a performance model.

The computing performance optimization unit 300 is configured to invoke a hardware operator library to compute the computing performance limited operator, and generate a first code.

The memory access performance optimization unit 400 is configured to optimize the memory access performance restriction operator by using a graph intermediate representation, where edges in the graph intermediate representation represent pieces of data on a specified memory hierarchy, and nodes in the graph intermediate representation represent one or a set of hardware instructions for performing specified operations.

The code combining unit 500 is configured to combine the first code and the second code to generate an optimized tensor program.

Preferably, as shown in fig. 14, the operator classifying unit 200 in this embodiment may include: a performance calculation module 210 and an operator classification module 220, which are interconnected.

The performance calculation module 210 is configured to predict a calculation time and a memory time of each operator in the calculation map by using a performance model.

The operator classification module 220 is configured to determine whether each operator belongs to a computational performance limited operator or a memory access performance limited operator based on the computation time and the memory access time.

Preferably, the above-mentioned specifying operation includes: a data movement operation, a calculation operation, a synchronization operation, or a dimension transformation operation.

Preferably, as shown in fig. 15, the access performance optimizing unit 400 includes: the sub-graph acquisition module 410, the intermediate representation acquisition module 420, the combination module 430, the graph rewriting module 440, the alternative combination selection module 450, and the second code generation module 460 are sequentially connected therebetween.

The sub-graph acquisition module 410 is configured to form all access performance limiting operators into several sub-graphs without external dependencies.

The intermediate representation obtaining module 420 is configured to obtain, for each operator in each sub-graph, all intermediate representations of the graphs in which the operator can be implemented.

The combination module 430 is configured to combine the intermediate representation of all operators in each sub-graph to obtain a corresponding intermediate representation combination of the sub-graph.

The graph rewriting module 440 is configured to rewrite the graph intermediate representation combination based on a preset rule, so as to obtain an optimized graph intermediate representation combination.

The alternative combination selection module 450 is configured to select, from the optimized graph intermediate representation combinations, a number of alternative graph intermediate representation combinations that meet the requirements using a performance model.

The second code generation module 460 is configured to generate a second code based on a combination of several alternative inter-graph representations of each sub-graph.

Preferably, the preset rule includes a first rule, a second rule and a third rule, where:

the first rule is: when a plurality of data moving operations access the same data sheet, adding a synchronous operation corresponding to the scope according to the scope of the moving operation;

the second rule is: replacing the write-synchronize-read operation with a synchronize operation and replacing the synchronize-read operation with a synchronize-read operation;

the third rule is: operations that do not affect memory performance are forced to be swapped with operations that are swappable with their precursors.

Preferably, the graph rewriting module 440 is specifically configured to: and optimizing the graph intermediate representation combination according to the sequence of the first rule, the third rule and the second rule in turn until all the rules can not continue optimizing the graph intermediate representation combination, so as to obtain the optimized graph intermediate representation combination.

Preferably, as shown in fig. 16, the second code generating module 460 may further include: the kernel generation sub-module 461, the topology sequence selection sub-module 462, the operation instruction generation sub-module 463, the mapping sub-module 464, and the performance test sub-module 465 are sequentially connected therebetween.

Kernel generation submodule 461 is used to generate a kernel for each of the intermediate representations of the alternate intermediate representation combinations.

The topology order selection sub-module 462 is operable to select a topology order for each of the intermediate representations that minimizes the total amount of memory slices that are activated at the same time.

The operation instruction generation sub-module 463 is configured to generate operation instructions in the kernel program according to the topology order by using the nodes in each intermediate representation of the graph.

The mapping sub-module 464 is configured to map each intermediate representation of the graph onto specific hardware, and generate an operation instruction on the specific hardware based on the operation instruction, so as to form a code of an alternative intermediate representation combination of the graph.

The performance test sub-module 465 is configured to perform a performance test on the codes of the intermediate representation combinations of the alternative graphs of each sub-graph, and select the optimal code of each sub-graph as the second code.

The detailed description of each unit may be referred to the corresponding description in the foregoing method embodiments, and will not be repeated here.

As can be seen from the above technical solutions, the tensor program optimizing device provided by the present application can optimize the memory performance, and the optimizing effect is more efficient than that of the prior art, and since the edges in the middle representation of the graph represent the data pieces on the specified memory hierarchy, the nodes in the middle representation of the graph represent one or a group of hardware instructions for executing the specified operation, the middle representation of the graph based on the explicit data movement and instruction level can also bring the following beneficial effects:

The embodiment of the invention also provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the method when executing the program.

The embodiment of the invention also provides a computer readable storage medium, and the computer readable storage medium stores a computer program for executing the method.

As shown in fig. 17, the electronic device 600 may further include: a communication module 110, an input unit 120, an audio processor 130, a display 160, a power supply 170. It is noted that the electronic device 600 need not include all of the components shown in fig. 17; in addition, the electronic device 600 may further include components not shown in fig. 17, to which reference is made to the related art.

As shown in fig. 17, the central processor 100, sometimes also referred to as a controller or operational control, may include a microprocessor or other processor device and/or logic device, which central processor 100 receives inputs and controls the operation of the various components of the electronic device 600.

The memory 140 may be, for example, one or more of a buffer, a flash memory, a hard drive, a removable media, a volatile memory, a non-volatile memory, or other suitable device. The information about failure may be stored, and a program for executing the information may be stored. And the central processor 100 can execute the program stored in the memory 140 to realize information storage or processing, etc.

The input unit 120 provides an input to the central processor 100. The input unit 120 is, for example, a key or a touch input device. The power supply 170 is used to provide power to the electronic device 600. The display 160 is used for displaying display objects such as images and characters. The display may be, for example, but not limited to, an LCD display.

The memory 140 may be a solid state memory such as Read Only Memory (ROM), random Access Memory (RAM), SIM card, or the like. But also a memory which holds information even when powered down, can be selectively erased and provided with further data, an example of which is sometimes referred to as EPROM or the like. Memory 140 may also be some other type of device. Memory 140 includes a buffer memory 141 (sometimes referred to as a buffer). The memory 140 may include an application/function storage 142, the application/function storage 142 for storing application programs and function programs or a flow for executing operations of the electronic device 600 by the central processor 100.

The memory 140 may also include a data store 143, the data store 143 for storing data, such as contacts, digital data, pictures, sounds, and/or any other data used by the electronic device. The driver storage 144 of the memory 140 may include various drivers of the electronic device for communication functions and/or for performing other functions of the electronic device (e.g., messaging applications, address book applications, etc.).

The communication module 110 is a transmitter/receiver 110 that transmits and receives signals via an antenna 111. A communication module (transmitter/receiver) 110 is coupled to the central processor 100 to provide an input signal and receive an output signal, which may be the same as in the case of a conventional mobile communication terminal.

Based on different communication technologies, a plurality of communication modules 110, such as a cellular network module, a bluetooth module, and/or a wireless local area network module, etc., may be provided in the same electronic device. The communication module (transmitter/receiver) 110 is also coupled to a speaker 131 and a microphone 132 via an audio processor 130 to provide audio output via the speaker 131 and to receive audio input from the microphone 132 to implement usual telecommunication functions. The audio processor 130 may include any suitable buffers, decoders, amplifiers and so forth. In addition, the audio processor 130 is also coupled to the central processor 100 so that sound can be recorded locally through the microphone 132 and so that sound stored locally can be played through the speaker 131.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The principles and embodiments of the present invention have been described in detail with reference to specific examples, which are provided to facilitate understanding of the method and core ideas of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. A method of tensor program optimization, the method comprising:

acquiring a calculation graph corresponding to a tensor program to be optimized;

obtaining a computing performance limited operator and a memory access performance limited operator in the computing graph by using a performance model;

invoking a hardware operator library to calculate the calculation performance limited operator to generate a first code;

optimizing the access performance limited operator by using a graph intermediate representation, generating a second code, wherein edges in the graph intermediate representation represent data pieces on a specified memory hierarchy, and nodes in the graph intermediate representation represent one or a group of hardware instructions for executing specified operations;

and merging the first code and the second code to generate an optimized tensor program.

2. The tensor program optimization method of claim 1, wherein said obtaining computational performance limited operators and access performance limited operators in said computational graph using a performance model comprises:

predicting the calculation time and the access time of each operator in the calculation graph by using a performance model;

determining whether each operator belongs to a computational performance limited operator or a memory performance limited operator based on the computation time and the memory time.

3. The tensor program optimization method of claim 1, wherein the specifying operation comprises: a data movement operation, a calculation operation, a synchronization operation, or a dimension transformation operation.

4. The tensor program optimization method of claim 1, wherein the optimizing the memory performance limit operator using the graph intermediate representation, generating the second code comprises:

all access performance limited operators form a plurality of subgraphs without external dependency relations;

for each operator in each subgraph, acquiring all graph intermediate representations capable of realizing the operator;

combining the graph intermediate representations of all operators in each sub-graph to obtain graph intermediate representation combinations corresponding to the sub-graph;

performing graph rewriting on the graph intermediate representation combination based on a preset rule to obtain an optimized graph intermediate representation combination;

selecting a plurality of alternative graph intermediate representation combinations meeting the requirements from the optimized graph intermediate representation combinations by utilizing a performance model;

a second code is generated based on a combination of several alternative graph intermediate representations of each sub-graph.

5. The tensor program optimization method of claim 4, wherein the predetermined rules include a first rule, a second rule, and a third rule, wherein:

6. The tensor program optimization method of claim 5, wherein said performing graph rewriting on the graph intermediate representation combination based on a preset rule to obtain an optimized graph intermediate representation combination comprises: and optimizing the graph intermediate representation combination according to the sequence of the first rule, the third rule and the second rule in turn until all the rules can not continue optimizing the graph intermediate representation combination, so as to obtain the optimized graph intermediate representation combination.

7. The tensor program optimization method of claim 4, wherein said generating the second code based on the combination of the intermediate representations of the plurality of candidate graphs for each sub-graph comprises:

generating a kernel for each of the intermediate representations in the combination of alternative intermediate representations;

Selecting a topology order for each intermediate representation that minimizes the total amount of memory slices that are activated at the same time;

generating an operation instruction in a kernel program by the nodes in each graph intermediate representation according to the topological order;

mapping each graph intermediate representation to specific hardware, and generating operation instructions on the specific hardware based on the operation instructions, thereby forming codes of alternative graph intermediate representation combinations;

performing performance test on the codes of the intermediate representation combination of the alternative graph of each sub-graph, and selecting the optimal code of each sub-graph as a second code.

8. A tensor program optimization apparatus, the apparatus comprising:

the calculation map acquisition unit is used for acquiring a calculation map corresponding to the tensor program to be optimized;

the operator classifying unit is used for obtaining a computing performance limited operator and a memory access performance limited operator in the computing graph by utilizing a performance model;

the computing performance optimizing unit is used for calling a hardware operator library to calculate the computing performance limited operator and generating a first code;

the access performance optimizing unit is used for optimizing the access performance limited operator by using a graph intermediate representation, generating a second code, wherein edges in the graph intermediate representation represent data sheets on a specified memory hierarchical structure, and nodes in the graph intermediate representation represent one or a group of hardware instructions for executing specified operations;

And the code merging unit is used for merging the first code and the second code to generate an optimized tensor program.

9. The tensor program optimization device of claim 8, wherein the operator classification unit comprises:

the performance calculation module is used for predicting the calculation time and the access time of each operator in the calculation graph by utilizing the performance model;

and the operator classification module is used for determining whether each operator belongs to a computational performance limited operator or a memory access performance limited operator based on the computation time and the memory access time.

10. The tensor program optimization device of claim 8, wherein the specifying operation comprises: a data movement operation, a calculation operation, a synchronization operation, or a dimension transformation operation.

11. The tensor program optimization apparatus of claim 8, wherein the memory access performance optimization unit includes:

the sub-graph acquisition module is used for forming a plurality of sub-graphs without external dependency relations by all access performance limited operators;

a diagram intermediate representation acquisition module, configured to acquire, for each operator in each subgraph, all diagram intermediate representations in which the operator can be implemented;

the combination module is used for combining the graph intermediate representations of all operators in each sub-graph to obtain graph intermediate representation combination corresponding to the sub-graph;

The diagram rewriting module is used for performing diagram rewriting on the diagram intermediate representation combination based on a preset rule to obtain an optimized diagram intermediate representation combination;

the alternative combination selection module is used for selecting a plurality of alternative graph intermediate representation combinations meeting the requirements from the optimized graph intermediate representation combinations by utilizing a performance model;

and the second code generation module is used for generating a second code based on a plurality of alternative graph intermediate representation combinations of each sub-graph.

12. The tensor program optimization device of claim 11, wherein the predetermined rules include a first rule, a second rule, and a third rule, wherein:

13. The tensor program optimization device of claim 12, wherein the graph rewriting module is specifically configured to: and optimizing the graph intermediate representation combination according to the sequence of the first rule, the third rule and the second rule in turn until all the rules can not continue optimizing the graph intermediate representation combination, so as to obtain the optimized graph intermediate representation combination.

14. The tensor program optimization device of claim 11, wherein the second code generation module comprises:

a kernel generation sub-module for generating a kernel for each of the intermediate representations in the candidate intermediate representation combination;

a topology sequence selection sub-module, configured to select, for each intermediate representation of the graph, a topology sequence that minimizes a total amount of memory slices that are activated at a same time;

the operation instruction generation sub-module is used for generating operation instructions in the kernel program according to the topological order by the nodes in each graph intermediate representation;

the mapping sub-module is used for mapping each graph intermediate representation to specific hardware, generating an operation instruction on the specific hardware based on the operation instruction, and forming a code of an alternative graph intermediate representation combination;

and the performance test sub-module is used for performing performance test on the codes of the intermediate representation combination of the alternative graphs of each sub-graph, and selecting the optimal code of each sub-graph as a second code.

15. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed by the processor.

16. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 7.