CN115423082A - Automatic optimization method for depth model calculation graph related to hardware characteristics - Google Patents

Automatic optimization method for depth model calculation graph related to hardware characteristics Download PDF

Info

Publication number
CN115423082A
CN115423082A CN202211175991.8A CN202211175991A CN115423082A CN 115423082 A CN115423082 A CN 115423082A CN 202211175991 A CN202211175991 A CN 202211175991A CN 115423082 A CN115423082 A CN 115423082A
Authority
CN
China
Prior art keywords
operator
fusion
operators
fused
subgraph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211175991.8A
Other languages
Chinese (zh)
Inventor
林涵群
东东
姜宏旭
李波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN202211175991.8A priority Critical patent/CN115423082A/en
Publication of CN115423082A publication Critical patent/CN115423082A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an automatic optimization method of a depth model calculation graph related to hardware characteristics, which comprises the following steps: converting the deep learning model into intermediate representation according to a preset rule; performing first operator fusion on the intermediate representation according to a static operator fusion rule to form a composite subgraph; carrying out parallelization processing on operators in the composite subgraph according to hardware resources, and carrying out secondary operator fusion to form a polymerized subgraph; performing cross-boundary optimization processing on the aggregate subgraph; forming an original operator and a fused operator list according to the records of the first operator fusion and the second operator fusion; constructing a cost model on a target hardware platform based on the original operator and the fused operator list, and evaluating the fusion effect; segmenting the fusion operator according to the evaluation result of the fusion effect, and taking the generated subgraph as a cut-in generation code of the operator layer; by the method, when the mainstream graph in the deep learning compiler is optimized, the labor cost is reduced, and the utilization rate of hardware resources is improved.

Description

Automatic optimization method for depth model calculation graph related to hardware characteristics
Technical Field
The invention belongs to the technical field of deep learning model compilation acceleration in computer science, and particularly relates to an automatic optimization method of a depth model calculation graph related to hardware characteristics.
Background
In recent years, deep learning is developed rapidly, and the industry presents a plurality of deep learning algorithm development frameworks. Since deep learning has a wide application range and great demands on computing power, it is generally required to run a deep learning algorithm on various general and special hardware, such as various types of CPUs, GPUs, TPUs, NPUs, and the like. This presents a combinatorial explosion between the frame and the hardware. Meanwhile, many algorithm networks are emerging, such as YOLO, BERT, GPT, and so on. The algorithm networks are composed of operators of different types, different shapes and different connection relations. Eventually they will run on different kinds and models of hardware. This results in high costs for manual scene development and implementation of the optimization operators.
The calculation graph is a directed graph and is composed of the following contents: a set of nodes, each node representing an operation, an operation; a set of directed edges, each marking relationships (data transfer and control dependencies) between nodes. The speed of neural network training and reasoning can be improved through the optimization of the computational graph. At present, the mainstream frameworks Tensorflow, pythrch, TVM and the like adopt a plurality of computational graph optimization means to perform accelerated computation, the Tensorflow provides an API of a graph optimizer, and a user can directly call the API; the TVM performs calculation optimization by adopting methods such as operator fusion and the like; however, the mainstream graph optimization process in the current deep learning compiler usually requires a lot of manpower to perform optimization manually, and the optimization strategy is independent of hardware, so that the hardware resources cannot be utilized to the maximum.
Therefore, how to reduce labor cost and improve hardware resource utilization rate when optimizing the main flow graph in the deep learning compiler becomes a key problem of current research.
Disclosure of Invention
In view of the above problems, the present invention provides an automatic optimization method for a depth model computation graph related to hardware characteristics, which at least solves some of the above technical problems, and by using the method, when optimizing a main flow graph in a deep learning compiler, labor cost is reduced and hardware resource utilization rate is improved.
The embodiment of the invention provides an automatic optimization method of a depth model calculation graph related to hardware characteristics, which comprises the following steps:
s1, obtaining a deep learning model; converting the deep learning model into intermediate representation according to a preset rule; the intermediate representation is used for describing the structure of the computation graph;
s2, performing first operator fusion on the intermediate representation according to a static operator fusion rule to form a composite subgraph;
s3, parallelizing the operators in the composite subgraph according to hardware resources, and performing secondary operator fusion to form a polymerized subgraph;
s4, performing cross-boundary optimization processing on the aggregate subgraph;
s5, forming an original operator and a fused operator list according to the records of the first operator fusion and the second operator fusion; constructing a cost model on a target hardware platform based on the original operator and the fused operator list, and evaluating the fusion effect;
and S6, segmenting the fusion operator according to the evaluation result of the fusion effect, and using the generated subgraph as the cut-in generation code of the operator layer.
Further, the intermediate representation is used for defining the calculation type, the input number and the output number of each operator in the calculation graph and the related parameters of the calculation; the intermediate representation is also used for recording the number of the composition operators of the fusion operator and the related information of the composition operators.
Further, the S2 specifically includes: according to the intermediate representation of the completion of the conversion, traversing the calculation graph from the output in a reverse order, and fusing operators which meet the fusion rule of the static operators in the calculation graph; the process is iterated for a plurality of times until no operator can be fused in the computation graph; and reintegrating the fused computation graph to form a composite subgraph.
Further, the static operator fusion rule comprises:
data dependence exists among the fused operators;
merging should not create directed loops between subgraphs;
the cache on the chip is enough after fusion, and data does not need to be carried from a main memory in the calculation process.
Further, in the S3, the parallelization process includes: abstracting hardware resources at the bottom layer into a virtual computing unit, scheduling the to-be-performed computation to the computing unit, and enabling an operator to complete parallelization processing.
Further, the parallelization process comprises operator internal parallelization and operator inter-operator parallelization;
the operator internal parallelism comprises: combing parallel logics among operators when the operators are fused, and paralleling the calculation without data dependence in the fused operators under the condition of permission of hardware resources;
the inter-operator parallelism comprises: after the operators are fused, continuing to perform parallel operation on the fused operators; and when the fusion operator is parallel, the internal parallelism of the operator is readjusted at the same time.
Further, if extra resources can achieve inter-operator parallelism while reducing the degree of intra-operator parallelism, the inter-operator parallelism strategy is prioritized.
Further, the S4 specifically includes:
using a topological sorting traversal calculation graph, firstly selecting a node with the degree of 0 to execute, then deleting the node and edges connected with the node, and then finding the next node with the degree of 0 to execute;
performing cross-boundary optimization processing in the process of traversing the computation graph; the cross-boundary optimization process comprises: constant folding, inline optimization, and common sub-expression extraction.
Further, in S5, the cost model is trained by:
traversing the calculation graph, extracting the characteristics of all operators in the calculation graph at the moment, and forming the operator number and operator type of the fusion operator according to the characteristics;
selecting a preset number of operators to be deployed on hardware, and measuring time consumption and energy consumption of the hardware to serve as marking data;
constructing a model by using the marked data according to a semi-supervised algorithm and predicting the marked data;
adding the data with high confidence coefficient into a marking data set, using the data as marking data and retraining the model;
and when the data in the marked data set exceeds a preset proportion, finishing the training to obtain a trained cost model.
Further, in S6, if the fusion effect does not meet the preset condition, the fusion operator is segmented.
Compared with the prior art, the automatic optimization method of the depth model calculation graph related to the hardware characteristics, which is disclosed by the invention, has the following beneficial effects:
the invention uses the intermediate representation to be compatible with the existing deep learning compiler better, so that the method can be widely applied to the field of deep learning compilation acceleration. The composite operator in the original model is expanded into the fine-grained operator, so that the interior of the original operator can be optimized in a finer granularity mode, the operator types are unified, the number of operators provided by the framework is reduced, and the subsequent fusion and cost model construction are facilitated.
The invention accelerates the reasoning process from the perspective of reducing main memory reading and maximally utilizing hardware resources through operator fusion and parallelization processing. The operator fusion arrangement can reduce the data volume of the parallel processing before the parallel processing, and can cooperatively process the parallelism between operators and the parallelism in the operators.
The method uses the cost model to evaluate the energy consumption and the time consumption of the operator, and performs fusion operator segmentation, so that the graph optimization scheme can be dynamically adapted to the rear ends of different hardware. After fusion and parallelization processing, evaluation of the cost model on the performance of the operator is closer to the actual operation result of the operator on hardware, and the predicted value is more accurate.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
fig. 1 is a general flowchart of a method for automatically optimizing a depth model calculation graph related to hardware characteristics according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a semantic tree according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of a fusion rule of static operators according to an embodiment of the present invention.
Fig. 4 is a schematic diagram of operator parallelism according to an embodiment of the present invention.
Fig. 5 is a schematic diagram of a cost model construction process provided by the embodiment of the present invention.
Fig. 6 is a schematic diagram of an operator segmentation process according to an embodiment of the present invention.
Fig. 7 is a schematic structural diagram of a CNN provided in an embodiment of the present invention.
Fig. 8 is a schematic diagram of a neural network structure according to an embodiment of the present invention.
Fig. 9 is a schematic diagram of operator fusion performed on a complex model according to an embodiment of the present invention.
Fig. 10 is a schematic diagram of performing parallel processing on a computation graph according to an embodiment of the present invention.
Fig. 11 is a schematic diagram of performing operator segmentation on a CNN structure according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Referring to fig. 1, an embodiment of the present invention provides an automatic optimization method for a depth model computation graph related to hardware characteristics, which specifically includes the following steps:
s1, acquiring a deep learning model; converting the deep learning model into intermediate representation according to a preset rule; the intermediate representation is used for describing the structure of the computation graph;
s2, performing first operator fusion on the intermediate representation according to a static operator fusion rule to form a composite subgraph;
s3, parallelizing operators in the composite subgraph according to hardware resources, and performing secondary operator fusion to form a polymerization subgraph;
s4, performing cross-boundary optimization processing on the aggregate subgraph;
s5, forming an original operator and a fused operator list according to the records of the first operator fusion and the second operator fusion, constructing a cost model on a target hardware platform, and evaluating the fusion effect;
and S6, segmenting the fusion operator according to the evaluation result of the fusion effect, and using the generated subgraph as the cut-in generation code of the operator layer.
The above steps will be described in detail below.
(1) Intermediate representation (Intermediate Reference)
Receiving main flow models such as ONNX, pythrch, tensorflow and the like as input, and converting the main flow models into self-defined intermediate representation for describing the structure of the computation graph; for example, to the Relay IR of TVM for further processing. Wherein, relay is a programming language with various functions and is used for intermediate representation of machine learning system expression; relay IR is a purely expression-oriented expression language;
the intermediate representation defines the type of computation, the number of inputs and the number of outputs for each operator in the computation graph, and the necessary parameters associated with the computation. For fusion operators, the intermediate representation is also used to record that it consists of several operators and to record the information about these operators. The resulting semantic tree is shown in fig. 2.
(2) Operator static fusion
The whole process is as follows: according to the converted IR, traversing the calculation graph in a reverse order from the output, and fusing the operators meeting the fusion rule of the static operators; the process is iterated for a plurality of times until no operator can be fused in the computation graph; and reintegrating the fused computation graph to form a composite subgraph.
The static operator fusion rule is shown by referring to fig. 3, the purpose of the operator fusion is to reduce the number of times of carrying data from a main memory, and in order to ensure that possible fusion operators are not missed when the cost model is subsequently evaluated, a greedy strategy is adopted for fusion, and all operators with data dependence are fused as far as possible. However, it is necessary to ensure that the computation amount of the fused operator does not exceed the hardware tolerance, so some fusion methods, which are generally considered to have a weak effect on improving performance, are also abandoned. The basic fusion criterion should satisfy: 1) Data dependence exists among fusion operators; 2) Merging should not create directed loops between subgraphs; 3) The fused hardware resources allow the calculation to be completed without carrying data from the main memory; namely, the on-chip cache is enough after fusion, and data does not need to be carried from a main memory in the calculation process.
Operators commonly used in deep learning include the following five classes:
1) An algebraic operator. And computing operations representing corresponding elements between the two tensors, including RELU, ADD, MUL and the like. 2) And broadcasting the operator. And the operators for processing the tensors with different shapes in the calculation process through the copying operation are also calculated in the broadcast operators including BN and the like if the tensor shapes need to be adjusted in the calculation process of the algebraic operators. 3) And reducing the operator. The operation of reducing the number of elements included in the tensor according to the specified axis includes SUM, ARGMAX, and the like. 4) And (4) a complex operator. The operators which are relatively general and large in calculation amount in the neural network comprise CONV5x5, FC and the like. 5) Non-fusible operators. The operator without profit after being fused with other operators is defined, and comprises constants and the like. 6) A bi-phase operator. Different operators are expressed after fusion on different hardware, and whether benefits including MAXFOOL, CONV1x1 and the like exist cannot be determined after fusion. On the basis of the fusion criterion, the following static fusion rule is obtained through the support of a large number of historical experiments. ( The homogeneous operators may be classified into different types of operators according to specific parameters, for example, CONV1 × 1 is a two-phase operator, and CONV5 × 5 is a complex operator. The specific classification is determined according to actual conditions. )
The static operator fusion rule means that performance improvement is brought by fusion on most hardware platforms, for example, the fusion of CONV + RELU + BN conforms to a first rule of stable fusion. The forbidden convergence rule means that in most cases the convergence will degrade the performance or have no effect on the overall performance. In addition to the fusion type forbidden by the fusion rule, the operator can be fused on the premise of conforming to the basic fusion rule. If the operator type accords with the stable fusion rule, the operator cannot be segmented subsequently, otherwise, whether the fusion is effective or not is evaluated through a cost model in the operator segmentation stage, and if the performance is reduced after the fusion, the fusion operator is disassembled again. The operator fusion done at this stage cannot guarantee a positive impact on the reasoning as the tables of different operators on different hardware fluctuate. Therefore, operator segmentation of hardware perception is carried out later, and the adaptability of the back end is improved.
(3) Parallelized processing
Abstracting hardware resources at the bottom layer into virtual computing units, scheduling the to-be-performed computation to the computing units, and enabling operators to complete parallelization processing; the operator parallelism inside the fusion operator and the operator parallelism after the fusion are performed, so that the function of fully utilizing hardware resources is achieved. The parallel structure is shown in fig. 4.
The parallelization processing comprises operator internal parallelization and operator parallel processing; wherein:
the operator internal parallelism comprises: combing parallel logics among operators when the operators are fused, and paralleling the calculation without data dependence in the fused operators under the permission of hardware resources; by the loop optimization technology, the parallelism is not limited among fused operators, but the calculation inside each operator is accelerated;
the inter-operator parallelism comprises: after the operators are fused, continuing to perform parallel operation on the fused operators; when the fusion operators are parallel, the internal parallelism of the operators is readjusted at the same time; and if the redundant resources can realize the inter-operator parallel under the condition of reducing the intra-operator parallel degree, the inter-operator parallel strategy is preferentially considered.
(4) Cross-boundary optimization
Performing cross-boundary optimization operations such as constant propagation, public sub-expression extraction and the like on the aggregation subgraph; the method specifically comprises the following steps: using a topological sorting traversal calculation graph, firstly selecting a node with the degree of income of 0 for execution, then deleting the node and edges connected with the node, and then finding the next node with the degree of income of 0 for execution; performing cross-boundary optimization processing in the process of traversing the computation graph; the cross-boundary optimization process comprises the following steps: constant folding, inline optimization, public sub-expression extraction and the like; wherein:
constant folding: if all the inputs on which a computation depends are more constant, the result is directly computed, replacing the nodes in the graph.
And (3) inline optimization: and expanding the function nodes with small code amount and simple calculation, and deleting useless nodes.
Common sub-expression elimination: in a program, several expressions are called common sub-expressions if their types, parameters, and inputs are the same. For the common sub-expressions, only the value of one expression needs to be calculated, and the values of other expressions can be obtained through evaluation. The calculation processed is recorded in the traversing process, and if the calculation to be processed is the same as the calculation to be processed, the calculation is directly replaced.
(5) Cost model
The main function of the cost model in the embodiment of the invention is to evaluate the running time and power consumption of each operator on a target hardware platform under the condition of not actually deploying, and provide a reference for dynamic fusion and segmentation of a computational graph. To achieve the target hardware platform-aware automatic optimization, a new most suitable cost model needs to be constructed for each different hardware. The cost model using the deep learning method needs a large amount of data during construction, considerable time is consumed for acquiring the data in actual use, and the data size must be reduced in order to accelerate the optimization process. A semi-supervised regression model is used as the reference model.
The process of establishing the cost model is shown in fig. 5. Firstly, extracting corresponding pre-training models according to different hardware architectures, such as CPU, GPU, FPGA, NPU and the like. And traversing the calculation graph, and extracting the characteristics of all operators (fusion operators and operators forming the fusion operators) in the graph at the moment, including the size of input data, the type of the input data, the number of the operators forming the fusion operators and the type of the operators. Then, 10% of operators are selected to be deployed on hardware to actually measure time and energy consumption and serve as marking data. Then, according to a semi-supervised algorithm, a model is constructed by using the marked data and the unmarked data are predicted. And finally, adding the data with high confidence coefficient into the labeled data set, using the labeled data and retraining the model. When the data exceeding a certain proportion is in the marked data set, finishing training to obtain a trained cost model;
and evaluating the fusion effect through the trained cost model, and segmenting an operator if the evaluation result shows that the performance after fusion is poor. And finally, taking the generated subgraph as an input generation code of the operator layer.
(6) Subgraph dynamic recutting
And traversing the calculation graph in a reverse order, re-segmenting the operators which do not accord with the stable fusion rule according to the records when the operators are fused, and recording the operators before and after segmentation. The recorded content is mainly operator characteristics required by the cost model. After segmentation, internal parallel logics of partial fusion operators are disturbed, rearrangement is needed again according to parallel rules among the operators, and the cost models also need to be considered when being trained. And respectively using cost models of time consumption and energy consumption to evaluate the operators before and after segmentation. And if the time consumption and the energy consumption are reduced after segmentation, segmenting an operator, otherwise, adjusting according to the requirements of the user. If the user requires the lowest consumed time, segmenting under the condition of time consumption reduction; and if the user requires the lowest energy consumption, cutting under the condition of energy consumption reduction. In order to prevent the time consumption or the energy consumption from being overlarge, an energy consumption upper limit is set, and if the energy consumption upper limit exceeds the energy consumption upper limit, the operator for segmentation is adjusted again. The specific flow is shown in fig. 6.
Several specific implementations of the present solution will be described below. Practical implementations are not limited to the several implementations described in this disclosure. To better illustrate how the method can be optimized efficiently, some of the concepts and principles of the related art will be described first.
Example 1: by this example, the principle of the method for reducing the model inference time is briefly explained. A simple CNN structure is shown on the left side of fig. 7. The result of optimizing it according to the present invention is shown on the right side of fig. 7. In the original structure, after the CONV calculation is performed, data needs to be stored in a memory, and in the process of preparing for performing the BN, the data needs to be carried from the memory to an on-chip cache. Once BN is calculated, it is also necessary to carry twice the data to start the calculation of the RELU layer. And according to the principle of operator fusion, the optimized calculation graph structure stores the data subjected to CONV calculation into an on-chip cache, and the BN operation is directly performed. After the calculation of BN, the data is also stored in a buffer memory, and then the RELU operation is directly carried out. Compared with the original structure, the process of carrying data from the memory for four times among operators is omitted, and a large amount of reasoning time is saved.
Example 2: by way of example, the principles of the present method strongly related to hardware will be briefly explained. As shown in the left side of fig. 8, the structure before two operators in a neural network are not fused; in the middle of fig. 8 is the structure of the two operators after being fused on the hardware a; the right side of fig. 8 shows the structure of the two operators after being fused on the hardware B. For different hardware A and B, different cost models are constructed by the method, and different computation graphs are output according to the operator running time evaluated by the cost models. And for the hardware A, constructing a cost model a, predicting the running time of the CONV1 to be t1 by the cost model a, predicting the running time of the CONV2 to be t2 by the cost model a, and performing fusion on the running time t3 of the composite operator. Since t3< t1+ t2, it is decided to merge this operator and output a computation graph as in the middle of fig. 8. And for the hardware B, constructing a cost model B, predicting the running time of the CONV1 to be t1', predicting the running time of the CONV2 to be t2', and predicting the running time of the fused composite operator to be t3'. Since t3' > t1' + t2', it is decided to slice the operator, outputting a computation graph as shown on the right side of fig. 8. Although the optimization cost is low in the method of using the same operator fusion strategy for all hardware platforms, the strategies are obviously not adapted to every hardware, and certain performance is lost on part of the hardware. The method can dynamically modify the output calculation graph facing different hardware, so that the power consumption of the model running on all hardware is minimized.
Example 3: by means of the example, the process of operator fusion of the complex model by the method is simply described. Part of the structure in google net is shown on the left side of fig. 9; the right side of fig. 9 is the structure after the static operator fusion is performed by using the method. And (4) traversing the calculation graph in a reverse order, and inputting the non-fusible operator by the Concat operator, so that the operator is not fused. Conv1x1 and the operator Conv3x3 meet basic fusion rules, but do not meet stable fusion rules, and are fused into a Conv3-Conv1 operator, and subsequently, the benefits of operator fusion are re-evaluated. The Conv1x1 operator and the Conv5x5 operator meet basic fusion rules and are fused into a Conv5-Conv1 operator. Conv1x1 and MaxPool3x3 meet basic fusion rules, do not meet stable fusion rules, and are fused into a MaxPool3-Conv1 operator. Conv1x1 and MaxPhol 3x3 do not satisfy the basic fusion rules, so no fusion is performed. Conv3x3 and Relu satisfy the stable fusion rule, the Conv3-Relu operator is fused, and the fusion operator is not subsequently segmented. The fusion of the other operators is similar.
Example 4: by this example, the process of the method for parallel processing of the computation graph is briefly described. Fig. 10 shows the part of the computation graph after the static operator fusion that is subjected to parallel processing, and the other parts are only subjected to operator internal parallel processing. Data dependency exists among operators in the above-indicated box, so that serial processing is performed. The operators in the following boxes have no data dependence and can be processed in parallel. And determining the parallelism degree according to the abstracted virtual computing unit. If 4 computing units capable of performing convolution operation exist in the hardware at the same time, four operators in the following blocks can be completely parallel, and the final running time only depends on one operator with the longest running time. If only one computing unit capable of performing convolution operation is arranged in hardware, the total operators of the red boxes can be completely serialized, and the final running time is the sum of the running times of the four operators.
Example 5: by this example, the process of operator segmentation on the CNN structure by the method is briefly described. And (4) re-evaluating the fused operators which do not accord with the stable fusion rule through the cost model, respectively predicting the inference time or energy consumption (selected according to actual needs) before and after fusion, and re-disassembling the operators with increased energy consumption after fusion. As shown in fig. 11, after the cost model evaluation, it is found that both the inference time and the energy consumption are reduced after the fusion, so that the fusion state is maintained. After the CONV1-CONV5 fusion operator is evaluated by a cost model, the inference time and the energy consumption are increased after fusion, so that the two operators are disassembled and become an unfused state again.
Compared with other computational graph optimization methods, the method for automatically optimizing the depth model computational graph related to hardware characteristics provides an optimal computational graph aiming at different hardware, and solves the problem of low coupling between a graph layer and the hardware; a large amount of manual optimization is omitted, a computer is used for automatically generating a possible optimal subgraph and evaluating the performance of the optimal subgraph after actual deployment. The invention comprehensively considers and optimizes two factors of power consumption and time in the reasoning process; the method of fusing first and then segmenting is used, dynamic optimization and static optimization are combined, optimization cost is reduced, and optimization effect is improved. In the invention, the deep learning compiler can automatically generate the optimal code for any equipment only by developing once. Operators developed for the CPU, for example, can be used almost as they are for the GPU and D-chip, thereby significantly reducing cost. The method can be applied to an embedded reasoning scene, and supports reasoning acceleration of most CNN networks. The method can be compatible with a mainstream deep learning compiler, and mainly faces hardware such as an FPGA (field programmable gate array) and a GPU (graphic processing unit).
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

1. A method for automatically optimizing a depth model calculation graph related to hardware characteristics is characterized by comprising the following steps:
s1, acquiring a deep learning model; converting the deep learning model into intermediate representation according to a preset rule; the intermediate representation is used for describing the structure of the computation graph;
s2, performing primary operator fusion on the intermediate representation according to a static operator fusion rule to form a composite subgraph;
s3, parallelizing the operators in the composite subgraph according to hardware resources, and performing secondary operator fusion to form a polymerized subgraph;
s4, performing cross-boundary optimization processing on the aggregate subgraph;
s5, forming an original operator and a fused operator list according to the records of the first operator fusion and the second operator fusion; constructing a cost model on a target hardware platform based on the original operator and the fused operator list, and evaluating the fusion effect;
and S6, segmenting the fusion operator according to the evaluation result of the fusion effect, and using the generated subgraph as the cut-in generation code of the operator layer.
2. The method of claim 1, wherein the intermediate representation is used to define the computation type, the input number and the output number of each operator in the computation graph, and the related parameters of the computation; the intermediate representation is also used for recording the number of the composition operators of the fusion operator and the related information of the composition operators.
3. The method according to claim 1, wherein the S2 specifically includes: according to the intermediate representation of the completion of the conversion, the calculation graph is traversed in a reverse order from the output, and operators which meet the static operator fusion rule in the calculation graph are fused; iterating the process for multiple times until no operators which can be fused exist in the calculation graph; and reintegrating the fused computation graph to form a composite subgraph.
4. The method of claim 1, wherein the static operator fusion rule comprises:
data dependence exists among the fused operators;
the fusion should not create directed loops between subgraphs;
the cache on the chip is enough after fusion, and data does not need to be carried from a main memory in the calculation process.
5. The method for automatically optimizing the depth model computational graph related to hardware characteristics according to claim 1, wherein in S3, the parallelization process comprises: abstracting hardware resources at the bottom layer into a virtual computing unit, and scheduling the to-be-performed computation to the computing unit to enable operators to complete parallelization processing.
6. The method according to claim 1, wherein the parallelization process comprises intra-operator parallelization and inter-operator parallelization;
the operator internal parallelism comprises: combing parallel logics among operators when the operators are fused, and paralleling the calculation without data dependence in the fused operators under the permission of hardware resources;
the inter-operator parallelism comprises: after the operators are fused, continuing to perform parallel operation on the fused operators; and when the fusion operator is parallel, the internal parallelism of the operator is readjusted at the same time.
7. The method as claimed in claim 6, wherein if extra resources can implement inter-operator parallelism while reducing intra-operator parallelism, the inter-operator parallelism policy is prioritized.
8. The method according to claim 1, wherein the S4 specifically includes:
using a topological sorting traversal calculation graph, firstly selecting a node with the degree of income of 0 for execution, then deleting the node and edges connected with the node, and then finding the next node with the degree of income of 0 for execution;
performing cross-boundary optimization processing in the process of traversing the computation graph; the cross-boundary optimization processing comprises the following steps: constant folding, inline optimization, and common sub-expression extraction.
9. The method according to claim 1, wherein in S5, the cost model is trained by:
traversing the calculation graph, extracting the characteristics of all operators in the calculation graph at the moment, and forming the operator number and operator type of the fusion operator according to the characteristics;
selecting a preset number of operators to be deployed on hardware, measuring the time consumption and the energy consumption of the hardware, and taking the operators as mark data;
constructing a model by using the marked data according to a semi-supervised algorithm and predicting the marked data;
adding the data with high confidence coefficient into a marking data set, using the data as marking data and retraining the model;
and when the data in the marked data set exceeds a preset proportion, finishing the training to obtain a trained cost model.
10. The method according to claim 1, wherein in S6, if the fusion effect does not meet a preset condition, the fusion operator is segmented.
CN202211175991.8A 2022-09-26 2022-09-26 Automatic optimization method for depth model calculation graph related to hardware characteristics Pending CN115423082A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211175991.8A CN115423082A (en) 2022-09-26 2022-09-26 Automatic optimization method for depth model calculation graph related to hardware characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211175991.8A CN115423082A (en) 2022-09-26 2022-09-26 Automatic optimization method for depth model calculation graph related to hardware characteristics

Publications (1)

Publication Number Publication Date
CN115423082A true CN115423082A (en) 2022-12-02

Family

ID=84207119

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211175991.8A Pending CN115423082A (en) 2022-09-26 2022-09-26 Automatic optimization method for depth model calculation graph related to hardware characteristics

Country Status (1)

Country Link
CN (1) CN115423082A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116665020A (en) * 2023-07-31 2023-08-29 国网浙江省电力有限公司 Image recognition method, device, equipment and storage medium based on operator fusion
CN117009092A (en) * 2023-10-07 2023-11-07 之江实验室 Method and system for dynamically distributing compiling time resources based on multiple multi-arm slot machines

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116665020A (en) * 2023-07-31 2023-08-29 国网浙江省电力有限公司 Image recognition method, device, equipment and storage medium based on operator fusion
CN116665020B (en) * 2023-07-31 2024-04-12 国网浙江省电力有限公司 Image recognition method, device, equipment and storage medium based on operator fusion
CN117009092A (en) * 2023-10-07 2023-11-07 之江实验室 Method and system for dynamically distributing compiling time resources based on multiple multi-arm slot machines
CN117009092B (en) * 2023-10-07 2024-02-02 之江实验室 Method and system for dynamically distributing compiling time resources based on multiple multi-arm slot machines

Similar Documents

Publication Publication Date Title
CN110321999B (en) Neural network computational graph optimization method
US20220391678A1 (en) Neural network model processing method and apparatus, computer device, and storage medium
CN112579063B (en) Acceleration method for exploring optimization space in deep learning compiler
WO2022222839A1 (en) Intermediate representation method and apparatus for neural network model calculation
Banitalebi-Dehkordi et al. Auto-split: A general framework of collaborative edge-cloud AI
CN115423082A (en) Automatic optimization method for depth model calculation graph related to hardware characteristics
CN113918351B (en) Method and device for adapting to distributed training in deep learning framework and AI acceleration card
CN114008594A (en) Scheduling operations on a computational graph
CN112543918A (en) Neural network segmentation method, prediction method and related device
CN106682514B (en) System calling sequence feature pattern set generation method based on subgraph mining
WO2021000971A1 (en) Method and device for generating operation data and related product
CN111768004A (en) Model self-adaption method and system based on intelligent computing framework
WO2023093689A1 (en) Computational graph optimization method and apparatus, and device
CN115860081B (en) Core algorithm scheduling method, system, electronic equipment and storage medium
WO2022087788A1 (en) Neural network compiling optimization method and related apparatus
Le et al. Automatic gpu memory management for large neural models in tensorflow
CN112070213A (en) Neural network model optimization method, device, equipment and storage medium
WO2020164644A2 (en) Neural network model splitting method, apparatus, computer device and storage medium
CN113961267B (en) Service processing method, device and equipment
CN113705798A (en) Processing unit, computing device and computation graph optimization method of deep learning model
US20240161474A1 (en) Neural Network Inference Acceleration Method, Target Detection Method, Device, and Storage Medium
Wen et al. A swap dominated tensor re-generation strategy for training deep learning models
CN111061763B (en) Method and device for generating rule execution plan of rule engine
WO2022252694A1 (en) Neural network optimization method and apparatus
CN116048521A (en) Multi-level parallelism development method for multi-array coarse-granularity reconfigurable architecture

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination