CN112579063A

CN112579063A - Acceleration method for exploring optimization space in deep learning compiler

Info

Publication number: CN112579063A
Application number: CN202110223874.3A
Authority: CN
Inventors: 潘秋红; 何水兵; 陈刚; 杨弢
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2021-03-01
Filing date: 2021-03-01
Publication date: 2021-03-30
Anticipated expiration: 2041-03-01
Also published as: CN112579063B

Abstract

The invention discloses an acceleration method for exploring an optimized space in a deep learning compiler, aiming at optimizing a neural network effect through a compiling technology and greatly reducing the time consumption of the compiler for exploring an operator optimized space. The method first abstracts the neural network into the form of a computational graph. And secondly, carrying out graph optimization on the calculation graph, and defining an optimization space for each operator in the optimized calculation graph. And then, based on an operator containing optimized spatial information, providing an optimized spatial similarity calculation method. And finally, providing an operator state space exploration method based on similarity, clustering operators based on the similarity, carrying out full space exploration on the core operator in each cluster, exploring the rest operators of the same class in the optimal scheme of the core operator, and determining the optimal scheme of each operator of the whole neural network.

Description

Acceleration method for exploring optimization space in deep learning compiler

Technical Field

The invention relates to the application field of deep learning, compiling technology and high-performance computing cross technology, in particular to an acceleration method for exploring an optimized space in a deep learning compiler.

Background

Today, Deep Neural Networks (DNNs) have found wide application in the fields of image classification, natural language processing, autopilot, augmented reality, and other AI. Particularly, with the rapid development of computing devices, such as GPU, FPGA and specially designed neural network accelerator, the computing power of DNN is becoming more powerful, and the demand for efficient DNN in the field of artificial intelligence is also becoming stronger, so how to improve the operating efficiency of DNN is a very important research problem in recent years.

Now, there are many deep learning frameworks such as TensorFlow, PyTorch, Caffe, MXNet, etc. which can represent neural networks in the form of computational graphs, perform graph-level optimization on the computational graphs, and then map operators in DNN to third-party acceleration libraries such as CuDNN and MKL-DNN to obtain efficient DNN operation effects. However, the optimization at the graph level is generally independent of hardware, and cannot obtain finer-grained optimization effect according to hardware characteristics. Furthermore, the third party acceleration libraries that are relied upon are generally non-open source, which prevents programmers from having effective control and from easily porting DNNs across hardware devices. In addition, for operators not supported by the third-party library, optimization cannot be achieved, or a programmer is required to spend a lot of work to perform manual tuning.

In the research aiming at DNN acceleration, neural networks under various different frameworks are mapped to various hardware platforms through a compiling technology, the neural networks are accelerated in the mapping process, and the method for generating the optimized target platform codes achieves remarkable effects. A relatively efficient neural network compiler comprises the following execution flows: the neural network under various deep learning frameworks is firstly expressed into a computational graph through a high-level intermediate language, and graph-level optimization is carried out on the computational graph. The optimized computation graph is then converted to a low-level intermediate language representation and operator-level optimized. And finally, generating corresponding optimized codes according to the target hardware platform.

When the neural network is optimized at an operator level, the optimization space of each operator is defined in advance, and then the optimization space is explored by adopting a machine learning method to obtain the best optimization scheme. The optimization space of each operator is very large, for example, hundreds of millions of optimization schemes are possible for a Conv operator, so that the exploration of the optimization space of the operators is time-consuming, for example, a Yolo network needs more than one day to explore the optimization schemes.

Disclosure of Invention

In order to solve the defects of the prior art and achieve the purpose of greatly reducing the time consumption of an operator exploration optimization space of a compiler under the sacrifice of increasing the acceptable deep learning network reasoning time, the invention adopts the following technical scheme:

an acceleration method for exploring an optimized space in a deep learning compiler, comprising the steps of:

s1, abstracting the neural network and representing the neural network in the form of a calculation graph;

s2, carrying out graph optimization on the calculation graph;

s3, defining an optimization space for each operator in the optimized calculation graph, and performing optimization space similarity calculation based on the operator containing optimization space information;

s4, searching operator state space based on similarity, clustering operators based on similarity, searching full space of core operators in each cluster, searching other operators of the same class in the optimal scheme of the core operators, determining the optimal scheme of each operator of the whole neural network, and generating target platform codes according to a hardware platform, comprising the following steps:

s41, calculating a similarity matrix of the operators;

s42, the similarity matrix is used as input to execute AP clustering, the AP clustering algorithm does not need to determine the clustering number in advance, the center of each cluster after clustering is an input operator, and an operator does not need to be selected for each cluster again to serve as a core;

s43, for each clustered core operator, searching the complete optimization space of the operator, and storing n optimization schemes with shortest inference time in the searching process;

s44, for each non-core operator of each cluster, only n optimal schemes searched by traversing the core operators depended on by the non-core operators are needed;

and S45, generating a target platform code for each operator according to the optimization scheme, and deploying the operator codes to hardware to run a neural network according to the sequence in the calculation diagram.

Therefore, the time consumption of the compiler for exploring an operator optimization space is greatly reduced under the sacrifice of the increase of the acceptable deep learning network reasoning time.

Further, the neural network computational graph representation method in the step S1 includes the following steps:

s11, mapping the neural network constructed in the deep learning framework to a well-defined high-level intermediate language HIR;

and S12, analyzing the attribute of each operator based on the high-level intermediate language, and constructing a computation graph according to the data dependence relationship among the operators, wherein the constructed computation graph is a directed acyclic graph, each node in the graph represents one operator in the neural network, and edges in the graph represent the data dependence among the operators. A high-level intermediate language (HIR) is realized, the HIR is a specific domain language (DSL) and can represent a neural network computing and control flow, and a neural network in TensorFlow, PyTorch or ONNX format is mapped onto the HIR and represented by the HIR.

Further, the graph optimization method based on the computation graph in step S2 includes the following steps:

s21, performing operator fusion according to the calculation type of the operator, wherein the operator fusion is to combine a plurality of basic operators into a composite operator without storing intermediate results, thereby reducing unnecessary memory read-write time and improving cache locality;

s22, performing data layout optimization on the calculation graph after operator fusion according to hardware characteristics;

and S23, merging parallel operators for the calculation graph after the data layout optimization.

Further, the specific content of step S21 includes: firstly, constructing a domination tree, then traversing nodes in the domination tree, and merging the operators into a new compound operator if the nodes from one node to the domination node meet a predefined merging rule.

Further, the specific content of step S22 includes: in the calculation graph after operator fusion, whether a specified data layout scheme exists when the calculation graph is input is judged, if yes, the specified scheme is directly applied, and if not, the optimal data layout scheme is selected according to hardware characteristics, wherein the data layout scheme comprises row-first storage or column-first storage.

Further, the specific content of step S23 includes: the method has the advantages that a plurality of operators sharing the same input are combined into a larger operator, a larger kernel module is generated for the GPU, the GPU kernel starting expense is reduced, and the GPU utilization rate is improved.

Further, in step S3, the graph-optimized computation graph is mapped onto the LIR, represented by using the LIR, and an optimization space for each operator is defined. A low-level intermediate language LIR is realized, the LIR is a finer-grained intermediate language form, and operator-level optimization and target platform code generation can be performed based on the LIR.

Further, the operator performs tiling of multiple dimensions and multiple cyclic expansion optimization, the dimensions of the operator are expanded, the original length of the dimension is l, the dimension is tiled into m dimensions, k tiling schemes are provided in total, and by analogy, the number of possible selection schemes for each optimization operation is k, and the optimization space of the whole operator is the product of the number of all optimization operation schemes.

Further, in step S3, the method for calculating the optimized spatial similarity includes the following steps:

s31, defining a hash method for each optimization operation according to the optimization space attribute of the operator, and abstracting the optimization operation into a hash value;

s32, vectorizing each pair of operators, sequentially superposing Hash values as vector values according to an optimization operation sequence, and converting the vectors of the two operators into equal length through 0 filling operation; sequentially splicing the vector values of each optimized operation in the space to form a vector value of the whole space;

and S33, performing similarity calculation on the vector values of the pair of operators after splicing.

Furthermore, the similarity calculation is to take a cosine value as the similarity of the pair of operators for the pair of vector values after the concatenation. The classification result using the cosine value as the similarity measure is optimal.

The invention has the advantages and beneficial effects that:

the neural network generated by various deep learning frames is mapped to a uniform intermediate language, codes of various hardware platforms can be generated, the expenditure of programmers caused by model conversion due to different development frames is saved, and the capability of deploying the neural network across hardware equipment is realized; the front end optimizes the neural network at a graph level, the rear end optimizes the neural network at an operator level, the whole optimization process is automatically carried out, efficient optimized codes can be generated for a hardware platform, and a programmer does not need to spend a large amount of time and energy to carry out manual optimization; and when the back end carries out operator optimization, an optimization space exploration scheme based on clustering is executed, so that the time consumption generated by exploring the optimization space can be greatly reduced.

Drawings

FIG. 1 is a flow chart of an acceleration method for exploring an optimization space in a deep learning compiler according to the present invention.

FIG. 2 is a schematic diagram of the Conv-BN-Relu module calculation in the present invention.

FIG. 3 is a schematic diagram of pre-optimization calculations in the present invention.

FIG. 4 is a schematic diagram of calculation after optimization of an operator in the invention.

FIG. 5 is a schematic diagram of the parallel Conv operator optimized calculation in the invention.

FIG. 6 is a schematic diagram of an operator optimization space in the invention.

FIG. 7 is a schematic diagram of the operator optimization space vectorization in the invention.

FIG. 8 is a flow chart of neural network operator level optimization in the invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.

As shown in FIG. 1, an accelerated method for exploring optimization space in deep learning compiler aims to greatly reduce the time consumption of the compiler for exploring the operator optimization space at the sacrifice of acceptable increase of deep learning network inference time. The method first abstracts the neural network into the form of a computational graph. And secondly, carrying out graph optimization on the calculation graph, and defining an optimization space for each operator in the optimized calculation graph. And then, based on an operator containing optimized spatial information, providing an optimized spatial similarity calculation method. And finally, providing an operator state space exploration method based on similarity, clustering operators based on the similarity, carrying out full space exploration on a core operator in each cluster, exploring the rest operators of the same class in an optimal scheme of the core operator, determining an optimal scheme of each operator of the whole neural network, and generating a target platform code according to a hardware platform.

The method provided by the invention comprises a front end and a back end, wherein the front end takes a model generated by a deep learning framework as input, abstracts the model into a calculation graph expressed by a high-level intermediate language and performs graph optimization. The back end takes the calculation graph after the front end optimization as input, the calculation graph is expressed by a low-level intermediate language, then operator optimization is carried out, the exploration process is accelerated in the optimization process, and finally target platform codes are generated according to a hardware platform.

The specific implementation mode of the invention is as follows:

1) the model generated by the deep learning framework is represented as a computational graph.

1.1) the method realizes a high-level intermediate language HIR, which is a specific field language (DSL) and can represent a neural network computing and control flow.

1.2) mapping the neural network in TensorFlow, PyTorch or ONNX format onto the HIR, and expressing by the HIR.

1.3) constructing a computational graph based on the converted HIR, wherein the computational graph is a directed acyclic graph, each node in the graph represents one operator in the neural network, and edges in the graph represent data dependence among the operators. The computational graph establishes the dependency relationship between control flow and operators and data, and provides an interface for graph-level optimization. Fig. 2 shows a calculation graph generated by a simple Conv-BN-Relu module in a neural network, where each rounded rectangle in the graph represents an operator node, in this example, three nodes are included, each edge in the graph represents a data dependency between operators, for example, the data of the Conv operator dependency is input data and weight W1 data, and the BN operator dependency is a calculation result of a preceding Conv operator.

2) And optimizing the neural network at the graph level based on the computational graph.

2.1) carrying out operator fusion optimization on the calculation graph. The operator fusion is an optimization technology which combines a plurality of basic operators into a composite operator, does not need to store intermediate results, reduces unnecessary memory reading and writing and improves the cache locality. For a given computational graph, a domination tree is constructed first, then nodes in the domination tree are traversed, and if the nodes from one node to its domination node satisfy a predefined fusion rule, the operators are fused into a new compound operator. As shown in fig. 3, it is a calculation graph before operator fusion optimization, and a dominance tree is first calculated for it, for example, the dominance node of node 2 is node 1, and the dominance node of node 3 is node 2. And then traversing the dominating tree, performing rule matching, and fusing to form a new node 1 when the dominating node of the node 2 meets the fusion condition to the dominating node 1 thereof, wherein the dominating node of the node 3 becomes the fused node 1 and meets the new fusion condition with the node 1. Based on this, the Conv-BN-Relu three nodes formed by the

nodes

1, 2 and 3 can be fused into a new node, which is denoted as CBR, and the rule is applied to the whole calculation graph, and is fused into the form shown in FIG. 4.

2.2) performing data layout optimization on the calculation graph subjected to the operator fusion optimization. Each operator in the computation graph may be stored in the physical device in a number of ways. For the calculation graph after operator fusion optimization, firstly, judging whether a specified data layout scheme exists when the calculation graph is input, and if the calculation graph exists, directly applying the scheme. When the data layout scheme is not specified, an optimal data layout scheme is selected according to hardware characteristics, such as row-first storage or column-first storage. The most basic is whether the exploration data should be stored in the format of NHWC or NCHW.

2.3) carrying out parallel Conv operator combination on the calculation graph subjected to the data layout optimization. If a plurality of Conv operators sharing the same input exist in the calculation graph, the Conv operators are combined into a larger Conv operator, a larger kernel module is generated for the GPU, the GPU kernel starting expense is reduced, and the GPU utilization rate is improved. For example, for 3 CBR operators of 1x1 in fig. 4, they accept the same input and do the same, so they can be merged to form a larger 1x1 CBR operator, as shown in fig. 5.

3) And calculating the optimized spatial similarity of the neural network operator.

3.1) the method realizes a low-level intermediate language LIR, wherein the LIR is a finer-grained intermediate language form, and operator-level optimization and target platform code generation can be carried out based on the LIR.

3.2) mapping the calculation graph subjected to graph optimization onto the LIR, representing by using the LIR, and defining an optimization space of each operator. As shown in FIG. 6, the Conv operator for the NCHW layout is shown on the left side of the diagram and defines the optimization space as shown on the right side of the diagram. The Conv operator can perform 6-dimensional tiling and two cyclic expansion optimizations, and we take the first optimization operation "tile _ f" as an example, which means that the operator "f" dimension is expanded, the original length of the dimension is 32, and the dimension is tiled into 4 dimensions, so that there are 56 tiling schemes in total. By analogy, the number of possible alternatives for each optimization operation is the value in the rightmost line, and the optimization space of the entire Conv operator is the product of the number of all the alternatives, which is about 1.3 hundred million.

The tiling operation is to divide a certain dimension in the original space into m dimensions, for example, tile the dimension x in the original space to [ x1, x2, x3, x4 ]. In the above example, a dimension f with a length of 32 in the original space is selected and tiled into 4 dimensions, where the number of dimensions after tiling is set to 4, and when it is determined that the original space is to be tiled into 4 dimensions, it is equivalent to divide the length 32 into 4 layers of cycles, and there are several possible divisions of [ (1 × 32), (1 × 2 × 16), (1 × 4 × 8), (1 × 2 × 8), (1 × 2 × 4), (2 × 4) ], and there are [4, 12, 12, 12, 4] tiling schemes, respectively, and there are 56 tiling schemes in total when they are added to each other.

3.3) calculating operator to optimize the spatial similarity. Firstly, defining a Hash method for each optimization operation according to the optimization space attribute of an operator, and abstracting the optimization operation into a Hash value. And then vectorizing each pair of operators, sequentially superposing the hash values as vector values according to the optimization operation sequence, and converting the vectors of the two operators into equal length through 0 filling operation. As shown in fig. 7, an example of calculating a pair of operator optimized space vector values is given, for two space 1 and space 2 with the same optimization operation sequence, a set of optimization operations tile _ rc 1 and tile _ rc 2 in the space are sequentially selected, hash values of the optimization operations are selected for vectorization, after vectorization, the vector value length of tile _ rc 2 is smaller than tile _ rc 1, 0 is filled at the end of tile _ rc 2 vector to expand to be as long as the vector value of tile _ rc 1, and the calculated vector values are vec 1 and vec 2 respectively. By analogy, vector values of each optimization operation in the space are sequentially spliced to form vector values of the whole space. And finally, measuring the cosine value of the pair of vectors as the similarity of the pair of operators, wherein the search space similarity measurement is tried to be carried out by using the similarity of Jaccard and the like, but the classification result is not as good as the cosine value, so the cosine value is finally adopted as a similarity measurement index.

4) And performing operator level optimization on the neural network, and generating target platform codes. The execution flow is shown in fig. 8.

4.1) calculating the similarity matrix of the operators.

4.2) performing AP Clustering (Affinity probability Clustering) by taking the similarity matrix as input, wherein the AP Clustering algorithm does not need to determine the number of clusters in advance, the center of each cluster after Clustering is an input operator, and a core operator does not need to be calculated for each cluster again.

4.3) for each core operator of each cluster, exploring the complete optimization space of the operator, and storing n optimization schemes with the shortest inference time in the exploration process. For the Yolo v3-tiny model, for example, the clustering algorithm can divide the original 20 operators into 8 classes. Then we only need to do complete optimization space exploration to the 8 clustered core operators and save the 10 optimization schemes with the minimum inference time of each operator in the exploration process.

4.4) for the non-core operators in each cluster, n optimal schemes need to be explored by traversing the dependent core operators. For example, for a Yolo v3-tiny model, for 12 operators which are not the cluster center, we only need to search in 10 optimal optimization schemes of the core operator of the class where each operator is located.

And 4.5) generating target platform codes for each operator according to the optimization scheme, and deploying the operator codes to hardware to run a neural network according to the sequence in the computational graph. For codes needing to be deployed on a CPU, a third-party tool LLVM is called to generate corresponding C codes, and for an Nvidia GPU, corresponding CUDA codes are generated and then deployed to the GPU to run.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. An acceleration method for exploring an optimized space in a deep learning compiler, comprising the steps of:

s2, carrying out graph optimization on the calculation graph;

s41, calculating a similarity matrix of the operators;

s42, executing AP clustering by taking the similarity matrix as input, wherein the center of each clustered class is an input operator;

s44, traversing n optimal schemes explored by the dependent core operators for the non-core operators of each cluster;

2. The acceleration method for exploring the optimized space in the deep learning compiler according to claim 1, wherein the neural network computation graph representing method in step S1 comprises the following steps:

and S12, analyzing the attribute of each operator based on the high-level intermediate language, and constructing a computation graph according to the data dependence relationship among the operators, wherein the constructed computation graph is a directed acyclic graph, each node in the graph represents one operator in the neural network, and edges in the graph represent the data dependence among the operators.

3. The acceleration method for exploring an optimized space in a deep learning compiler according to claim 1, wherein the graph optimization method based on the computation graph in step S2 includes the following steps:

s21, carrying out operator fusion according to the calculation type of the operator, wherein the operator fusion is to combine a plurality of basic operators into a composite operator without storing intermediate results;

4. The acceleration method for exploring an optimized space in a deep learning compiler according to claim 3, wherein the detailed contents of the step S21 include: firstly, constructing a domination tree, then traversing nodes in the domination tree, and merging the operators into a new compound operator if the nodes from one node to the domination node meet a predefined merging rule.

5. The acceleration method for exploring an optimized space in a deep learning compiler according to claim 3, wherein the detailed contents of the step S22 include: in the calculation graph after operator fusion, whether a specified data layout scheme exists when the calculation graph is input is judged, if yes, the specified scheme is directly applied, and if not, the optimal data layout scheme is selected according to hardware characteristics, wherein the data layout scheme comprises row-first storage or column-first storage.

6. The acceleration method for exploring an optimized space in a deep learning compiler according to claim 3, wherein the detailed contents of the step S23 include: multiple operators sharing the same input are merged into a larger operator.

7. The method of claim 1, wherein in step S3, the graph-optimized computation graph is mapped onto LIR, represented by LIR, and the optimization space of each operator is defined.

8. The acceleration method for exploring an optimized space in a deep learning compiler according to claim 7, wherein the operator performs a plurality of dimension tiling and a plurality of loop expansion optimization, the dimension of the operator is expanded, the dimension is original length l, and is tiled into m dimensions, there are k tiling schemes in total, and so on, the possible number of selection schemes for each optimization operation is k, and the optimized space of the whole operator is the product of the number of all optimization operation schemes.

9. The acceleration method for exploring an optimized space in a deep learning compiler according to claim 7, wherein in said step S3, the method for calculating the similarity of the optimized space comprises the following steps:

10. The method of claim 9, wherein said similarity calculation is performed by taking cosine value as similarity of said pair of operators for said pair of vectors after splicing.