CN113703741A

CN113703741A - Neural network compiler configuration method and device, computer equipment and storage medium

Info

Publication number: CN113703741A
Application number: CN202111267825.6A
Authority: CN
Inventors: 白杨; 余蓓; 沈小勇; 吕江波; 贾佳亚
Original assignee: Shenzhen Smartmore Technology Co Ltd; Shanghai Smartmore Technology Co Ltd
Current assignee: Shenzhen Smartmore Technology Co Ltd; Shanghai Smartmore Technology Co Ltd
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2021-11-26
Anticipated expiration: 2041-10-29
Also published as: CN113703741B

Abstract

The application relates to a neural network compiler configuration method, a neural network compiler configuration device, a computer device and a storage medium. The method comprises the following steps: acquiring a target neural network model and an initial calculation chart corresponding to the target neural network model; the initial computation graph comprises a plurality of operators; dividing a plurality of operators to obtain a plurality of operator sets; determining various operator type combinations respectively corresponding to the operator sets, and acquiring operator operation time corresponding to the operator type combinations to obtain a plurality of operator operation times respectively corresponding to the operator sets; taking the operator type combination with the minimum operator operation time in the various operator type combinations as a target operator type combination corresponding to each operator set; and generating a target calculation graph of the target neural network model according to the target operator type combination corresponding to each operator set, and generating compiler configuration information aiming at the target neural network model according to the target calculation graph. By adopting the method, the configuration optimization effect of the neural network model compiler can be improved.

Description

Neural network compiler configuration method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a neural network compiler configuration method, an apparatus, a computer device, and a storage medium.

Background

With the development of computer technology, the application of neural network models has become more and more extensive, but the neural network models still face huge challenges in terms of operational performance. At present, in order to realize efficient operation of a neural network model, the operation performance of the neural network model is generally improved by performing optimal configuration on a compiler of the neural network model.

In the conventional technology, optimizing configuration of a compiler of a neural network model is usually based on a handwriting rule of an expert, and a calculation graph of the neural network model is split into a plurality of sub-graphs according to a splitting rule of expert experience, and then the sub-graphs are respectively optimized to obtain an intermediate representation of the compiler so as to generate a compiler code. However, in the current neural network compiler configuration mode, the optimization of the computation graph needs to be performed depending on the handwriting rule of an expert, so that the compiler configuration optimization effect on the neural network model of an emerging operator or a model structure is poor.

Disclosure of Invention

In view of the above, it is necessary to provide a neural network compiler configuration method, apparatus, computer device and storage medium for solving the above technical problems.

A neural network compiler configuration method, the method comprising:

acquiring a target neural network model to be configured by a compiler and an initial calculation chart corresponding to the target neural network model; the initial computation graph comprises a plurality of operators;

dividing the operators to obtain a plurality of operator sets;

determining various operator type combinations respectively corresponding to each operator set, and acquiring operator running time corresponding to each operator type combination to obtain a plurality of operator running times respectively corresponding to each operator set;

taking the operator type combination with the minimum operator running time in the plurality of operator type combinations as a target operator type combination corresponding to each operator set respectively;

and generating a target calculation graph of the target neural network model according to the target operator type combination corresponding to each operator set, and generating compiler configuration information aiming at the target neural network model according to the target calculation graph.

In one embodiment, the obtaining the operator runtime corresponding to each operator type combination includes: determining a current operator type combination, and acquiring the operator types of operators corresponding to the current operator type combination; determining operator fusion relation information corresponding to the current operator type combination according to the operator type of each operator; and acquiring the operator running time corresponding to the current operator type combination based on the operator fusion relation information.

In one embodiment, the obtaining, based on the operator fusion relationship information, an operator running time corresponding to the current operator type combination includes: determining fusion operators contained in the current operator type combination based on the operator fusion relation information; acquiring operator running time of the fusion operator; and obtaining the operator running time corresponding to the current operator type combination according to the operator running time of the fusion operator and the operator running time of the non-fusion operator contained in the preset current operator type combination.

In one embodiment, the generating compiler configuration information for the target neural network model according to the target computation graph includes: splitting the target calculation graph to obtain a plurality of sub-calculation graphs corresponding to the target calculation graph; acquiring a plurality of candidate configuration information corresponding to each sub-calculation graph to form a candidate configuration information set, and determining target configuration information corresponding to each sub-calculation graph from the candidate configuration information set; and obtaining compiler configuration information of the target neural network model according to the target configuration information corresponding to each sub-computation graph.

In one embodiment, the obtaining of the multiple candidate configuration information corresponding to each sub-computation graph includes: acquiring a configuration information structure template set for a current sub-computation graph and a plurality of groups of configuration information parameters randomly generated for the configuration information structure template; and obtaining a plurality of candidate configuration information corresponding to the current sub-computation graph according to the configuration information structure template and the plurality of groups of configuration information parameters.

In one embodiment, the determining, from the candidate configuration information set, target configuration information corresponding to each of the sub-computation graphs includes: determining candidate configuration information with the minimum operation time from the candidate configuration information set; and taking the candidate configuration information with the minimum operation time as the target configuration information of the current sub-computation graph.

In one embodiment, the determining the candidate configuration information with the minimum computation time from the candidate configuration information set includes: screening out a preset number of candidate configuration information from the candidate configuration information set to serve as test configuration information; acquiring the operation time corresponding to the test configuration information of the preset number respectively, and constructing an operation time loss function according to the operation time corresponding to the test configuration information of the preset number respectively; and predicting candidate configuration information in the candidate configuration information set by using the operation time loss function, and taking the candidate configuration information with the minimum operation time loss function as the candidate configuration information with the minimum operation time.

A neural network compiler configuration apparatus, the apparatus comprising:

the target network acquisition module is used for acquiring a target neural network model to be subjected to compiler configuration and an initial calculation graph corresponding to the target neural network model; the initial computation graph comprises a plurality of operators;

the operator set acquisition module is used for dividing the operators to obtain a plurality of operator sets;

the operator combination acquisition module is used for determining various operator type combinations corresponding to the operator sets respectively, and acquiring the operator operation time corresponding to the operator type combinations to obtain a plurality of operator operation times corresponding to the operator sets respectively;

a target combination determining module, configured to use, as an operator type combination corresponding to each operator set, an operator type combination with a minimum operator operation time in the multiple operator type combinations;

and the compiling information configuration module is used for generating a target calculation graph of the target neural network model according to the target operator type combination corresponding to each operator set, and generating compiler configuration information aiming at the target neural network model according to the target calculation graph.

A computer device comprising a memory storing a computer program and a processor implementing the steps of the above method when executing the computer program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method.

The neural network compiler configuration method, the neural network compiler configuration device, the computer equipment and the storage medium acquire a target neural network model to be subjected to compiler configuration and an initial calculation graph corresponding to the target neural network model; the initial computation graph comprises a plurality of operators; dividing a plurality of operators to obtain a plurality of operator sets; determining various operator type combinations respectively corresponding to the operator sets, and acquiring operator operation time corresponding to the operator type combinations to obtain a plurality of operator operation times respectively corresponding to the operator sets; taking the operator type combination with the minimum operator operation time in the various operator type combinations as a target operator type combination corresponding to each operator set respectively; and generating a target calculation graph of the target neural network model according to the target operator type combination corresponding to each operator set, and generating compiler configuration information aiming at the target neural network model according to the target calculation graph. The method comprises the steps of searching operator running time corresponding to a plurality of operator sets obtained after a target neural network model is split, generating a target calculation graph by utilizing a target operator type combination with the minimum operator running time, and then obtaining final compiler configuration information by utilizing the target calculation graph.

Drawings

FIG. 1 is a flow diagram illustrating a method for configuring a neural network compiler in accordance with an embodiment;

FIG. 2 is a schematic diagram illustrating a process for obtaining operator runtime for each combination of operator types in one embodiment;

FIG. 3 is a schematic flow chart illustrating the process of obtaining operator runtime based on fusion relationship information according to an embodiment;

FIG. 4 is a schematic flow diagram illustrating the generation of compiler configuration information for a target neural network in one embodiment;

FIG. 5 is a block diagram of a neural network compiler design in an example application;

FIG. 6 is a diagram illustrating node set partitioning in an example application;

FIG. 7 is a diagram illustrating a process of fusing dynamic programming algorithms in an application example;

FIG. 8 is an exemplary diagram of an automatic contour generation technique in an application example;

FIG. 9 is a block diagram of an apparatus for neural network compiler configuration according to an embodiment;

FIG. 10 is a diagram showing an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

It should be noted that, the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for presentation, analyzed data, etc.) referred to in this application are information and data authorized by the user or sufficiently authorized by each party; correspondingly, the application also provides a corresponding user authorization entrance for the user to select authorization or to select denial.

In one embodiment, as shown in fig. 1, a neural network compiler configuration method is provided, and this embodiment is illustrated by applying this method to a terminal, it is to be understood that this method may also be applied to a server, and may also be applied to a system including a terminal and a server, and is implemented by interaction between the terminal and the server. In this embodiment, the method includes the steps of:

step S101, acquiring a target neural network model to be configured by a compiler and an initial calculation chart corresponding to the target neural network model; the initial computation graph includes a plurality of operators.

The target neural network model refers to a neural network model that needs to be configured by a compiler, and may be, for example, an ViT model for performing an image classification task, or a DETR model for performing a target detection task, and an SETR model for an image segmentation task, and the initial computation graph refers to a computation graph abstracted from the target neural network model, where the computation graph includes a plurality of nodes, and is respectively used for representing a plurality of operators of the target neural network model. Specifically, the terminal may abstract the target neural network model that needs to be optimally configured by the compiler into a calculation graph form, so as to obtain an initial calculation graph corresponding to the target neural network model.

Step S102, a plurality of operators are divided to obtain a plurality of operator sets.

The operator set refers to a set formed by combining one or more divided operators after the terminal divides the plurality of operators included in the initial calculation graph, and the division of the operators is divided according to the calculation intensity based on the operators, so that the plurality of operator sets are obtained. For example, a certain computation graph includes 10 operator nodes, and the terminal may divide the 10 operator nodes into an operator set a composed of three operators, an operator set B composed of another three operators, and an operator set C composed of the remaining four operators, thereby obtaining a plurality of operator sets.

Step S103, determining various operator type combinations corresponding to the operator sets respectively, and acquiring operator operation time corresponding to the operator type combinations to obtain various operator operation time corresponding to the operator sets respectively.

The operator type combination refers to a combination of operator types corresponding to operators included in each operator set, and may refer to a combination of calculation types of multiple operators. In this embodiment, each operator may be assigned with multiple operator types, and therefore, the combination thereof may also be multiple, and if a certain operator set includes 3 operators, namely operator a, operator B, and operator C, and each operator corresponds to 5 operator types, for example, it may be an objective type, a reduction type, a component-out-usable type, an element-wise type, and an opaque type, respectively, then the operator set may have 5 × 5, i.e., 125 operator type combinations. And then, the terminal can obtain the operator running time corresponding to each operator type combination, and can obtain the operator running time of each operator type combination by performing GPU test on different operator running combinations, so as to obtain a plurality of operator running times corresponding to each operator set.

And step S104, taking the operator type combination with the minimum operator running time in the various operator type combinations as the target operator type combination corresponding to each operator set.

The target operator type combination refers to the operator type combination with the minimum operator operation time in the multiple operator type combinations corresponding to each operator set, and after the terminal obtains the multiple operator type combinations corresponding to a certain operator set and the operator operation time corresponding to each operator type combination, the operator type combination with the minimum operator operation time can be selected from the operator type combinations to serve as the target operator type combination corresponding to the operator set, so that the terminal can obtain the target operator type combination corresponding to each operator set.

And step S105, generating a target calculation graph of the target neural network model according to the target operator type combination corresponding to each operator set, and generating compiler configuration information aiming at the target neural network model according to the target calculation graph.

The target calculation graph is a calculation graph generated based on the target operator type combination again after the terminal determines the target operator type combination corresponding to each operator set, and the compiler configuration information refers to compiler configuration information generated according to the target calculation graph of the target neural network model, and may be configuration codes of a compiler of the target neural network model. Specifically, after the terminal obtains a target operator type combination corresponding to each operator set of the target neural network model, a calculation graph for the target neural network model can be regenerated according to the target operator type combination to serve as a target calculation graph of the target neural network model. Meanwhile, the terminal can further generate a corresponding compiler configuration code by using the obtained target calculation graph, and the compiler configuration code is used as compiler configuration information of the target neural network model.

In the neural network compiler configuration method, a target neural network model to be subjected to compiler configuration and an initial calculation graph corresponding to the target neural network model are obtained; the initial computation graph comprises a plurality of operators; dividing a plurality of operators to obtain a plurality of operator sets; determining various operator type combinations respectively corresponding to the operator sets, and acquiring operator operation time corresponding to the operator type combinations to obtain a plurality of operator operation times respectively corresponding to the operator sets; taking the operator type combination with the minimum operator operation time in the various operator type combinations as a target operator type combination corresponding to each operator set respectively; and generating a target calculation graph of the target neural network model according to the target operator type combination corresponding to each operator set, and generating compiler configuration information aiming at the target neural network model according to the target calculation graph. The method comprises the steps of searching operator running time corresponding to a plurality of operator sets obtained after a target neural network model is split, generating a target calculation graph by utilizing a target operator type combination with the minimum operator running time, and then obtaining final compiler configuration information by utilizing the target calculation graph.

In one embodiment, as shown in fig. 2, step S103 may further include:

step S201, determining a current operator type combination, and obtaining the operator types of the operators corresponding to the current operator type combination.

In this embodiment, the terminal may select any one of a plurality of operator type combinations corresponding to a certain operator set as the current operator type combination, and determine the operator type of each operator under the current operator type combination.

Step S202, determining operator fusion relation information corresponding to the current operator type combination according to the operator type of each operator.

The operator fusion relation information refers to fusion relation information between every two adjacent operators in an operator set under the current operator type combination, is determined by the operator types of the operators contained in the current operator type combination, and is expressed in a matrix form. For example, for an operator set in which three operators exist, if operator a and operator B are adjacent, operator B and operator C are adjacent, and operator a can be fused with operator B, and operator B cannot be fused with operator C, then its fusion relationship information may be [1,1,0], and if operator a cannot be fused with operator B, operator B can be fused with operator C, then its fusion relationship information may be [0,1,1], and so on, that is, fusion relationship information between operators is characterized by setting the elements of two adjacent operators in which a fusion relationship exists to 1.

Meanwhile, the fusion relationship between operators may also be determined according to the operator type of each operator, and the terminal may be preset with an operator type capable of implementing operator fusion, for example, it may be set that when both operators belong to a complex-out-fusable type, the two operators may be fused, and if one operator belongs to the complex-out-fusable type and the other operator belongs to a reduction type, the two operators may not be fused, in this embodiment, the terminal may determine the operator fusion relationship information corresponding to the current operator type combination by determining the current operator type combination, for example, the operator a, the operator B, and the operator C belong to the complex-out-fusable type, and the reduction type, respectively, and since the operator a and the operator B may be fused, and the operator B and the operator C can not be fused, so the operator fusion relation information can be [1,1,0 ].

And step S203, acquiring operator running time corresponding to the current operator type combination based on the operator fusion relation information.

After the operator fusion information is determined, the operator running time corresponding to the current operator type combination can be obtained based on the obtained operator fusion information, for example, the operator capable of being fused can be determined according to the fusion relation information, the operator capable of being fused is fused and then tested by a GPU to obtain the corresponding running time, the operator incapable of being fused is tested by the GPU to obtain the corresponding running time, and then the running time of the fused operator and the running time of the operator incapable of being fused are summed to obtain the operator running time corresponding to the current operator type combination.

In this embodiment, corresponding operator fusion relationship information may also be obtained through the operator types of each operator in the current operator type combination, and the final operator operation time is determined as the operator operation time of the current operator type combination based on the fusion relationship information, so that accuracy of obtaining the operator operation time can be improved.

Further, as shown in fig. 3, step S203 may further include:

step S301, determining fusion operators contained in the current operator type combination based on operator fusion relation information.

The fusion operator refers to an operator which can be fused in operators included in a current operator type combination and then is generated, the fusion operator can be obtained according to operator fusion information, for example, the operator fusion information is [1,1,0], it is indicated that the operator a and the operator B can be fused, and the operator B and the operator C cannot be fused, the current operator type combination includes a fusion operator AB and an operator C which cannot be fused, that is, a non-fusion operator C, and if the operator fusion information is [0,1,1], it is indicated that the operator B and the operator C can be fused, but the operator a and the operator B cannot be fused, the current operator type combination includes the fusion operator BC and the operator a which cannot be fused, that is, the non-fusion operator a. Specifically, the terminal may fuse the adjacent operators having the fusion relationship based on the fusion relationship information, so as to determine the corresponding fusion operator.

Step S302, obtaining operator running time of a fusion operator;

step S303, obtaining the operator running time corresponding to the current operator type combination according to the operator running time of the fusion operator and the operator running time of the non-fusion operator contained in the preset current operator type combination.

Then, the terminal may determine the operator running time of the fusion operator included in the current operator type combination, for example, the operator running time of the fusion operator may be determined by performing a GPU test on the fusion operator, or the terminal may obtain the operator running times of all fusion operators included in the current operator type combination and the operator running times of all non-fusion operators except the fusion operator in the current operator type combination by obtaining the operator running times of all fusion operators included in the current operator type combination and the operator running times of all non-fusion operators except the fusion operator in the current operator type combination.

Taking operator fusion information as [1,1,0] as an example, the visible operator a and the operator B can be fused to form a fusion operator AB, and then the fusion operator running time corresponding to the current operator type combination can be the operator running time of the fusion operator AB plus the operator running time of the operator C.

In this embodiment, the operator running time corresponding to the current operator type combination can be obtained by calculating the operator running time of the fusion operator and the operator running time of the non-fusion operator, and the acquisition efficiency of the operator running time can be improved.

In one embodiment, as shown in fig. 4, step S105 may further include:

step S401, splitting the target calculation graph to obtain a plurality of sub-calculation graphs corresponding to the target calculation graph.

Specifically, after the terminal obtains the target calculation graph, the whole target calculation graph can be further divided into a plurality of sub-graph structures according to the calculation intensity of the operator, so that a plurality of sub-calculation graphs are obtained.

In step S402, a plurality of candidate configuration information corresponding to each of the sub-computation graphs are acquired, a candidate configuration information set is formed, and target configuration information corresponding to each of the sub-computation graphs is determined from the candidate configuration information set.

In this embodiment, the terminal may configure multiple types of configuration information for the same sub-computation graph as candidate configuration information, and the target configuration information refers to configuration information finally applicable to each sub-computation graph, where the configuration information may be obtained by selecting from multiple candidate configuration information. Specifically, the terminal may obtain, for each sub-computation graph, a plurality of pieces of configuration information for the sub-computation graph as candidate configuration information to form a candidate configuration information set, and then may select one of the candidate configuration information from the candidate configuration information set, for example, the candidate configuration information with the smallest computation amount or the shortest running time may be used as the target configuration information corresponding to each sub-computation graph.

And step S403, obtaining compiler configuration information of the target neural network model according to the target configuration information corresponding to each sub-computation graph.

In step S402, after the terminal obtains the target configuration information corresponding to each sub-computation graph, the terminal may generate the configuration information of the corresponding target computation graph according to the target configuration information, which is used as the compiler configuration information of the final target neural network model.

In this embodiment, after the target computation graph is obtained, the target computation graph may be further split to obtain a plurality of sub-computation graphs, and tuning of each sub-computation graph is implemented by screening target configuration information of each sub-computation graph, so that the optimization effect of compiler configuration may be further improved.

Further, step S402 may further include: acquiring a configuration information structure template set for a current sub-computation graph and a plurality of groups of configuration information parameters randomly generated for the configuration information structure template; and obtaining a plurality of candidate configuration information corresponding to the current sub-computation graph according to the configuration information structure template and the plurality of groups of configuration information parameters.

The current sub-computation graph refers to any one of a plurality of sub-computation graphs obtained after splitting a target computation graph, the configuration information structure template is an outline template configured for the current sub-computation graph in advance by a user, and may be a for-loop structure template, for example, the configuration information parameters refer to information parameters used for generating candidate configuration information, and may be detail optimization information of a for-loop structure, for example, the terminal may set multiple sets of detail optimization information for the for-loop structure template, and may generate assignments in a random assignment manner as multiple sets of configuration information parameters.

Specifically, a user may set a corresponding configuration information structure template for the current sub-computation graph through the terminal, and the terminal may further generate multiple sets of corresponding configuration information parameters through a random assignment manner, so as to obtain multiple pieces of configuration information as multiple candidate configuration information corresponding to the current sub-computation graph by using the configuration information structure template and the multiple sets of configuration information parameters generated randomly.

In this embodiment, enumeration of the candidate configuration information may also be implemented by combining a configuration information structure template and a plurality of randomly generated sets of configuration information parameters, that is, by using a contour automatic generation technology, so that diversity of the candidate configuration information may be further ensured.

In one embodiment, step S402 may further include: determining candidate configuration information with the minimum operation time from the candidate configuration information set; and taking the candidate configuration information with the minimum operation time as the target configuration information of the current sub-computation graph.

In this embodiment, after obtaining a plurality of candidate configuration information of the current sub-calculation graph, the terminal may place the plurality of candidate configuration information on the back end for testing, so as to obtain the operation test time corresponding to each candidate configuration information, and may further use the candidate configuration information with the minimum operation test time as the target configuration information of the current sub-calculation graph.

For example, the candidate configuration information for the current sub-computation graph may include: the configuration information A, the configuration information B and the configuration information C are generated by the same configuration information structure template and a plurality of randomly generated configuration information parameters, such as configuration information parameter A, configuration information parameter B and configuration information parameter C, at this time, the terminal can respectively put the configuration information A, the configuration information B and the configuration information C into a back end for testing, so that a plurality of running test times, namely a plurality of operation times, can be obtained, and then the terminal can use candidate configuration information with the minimum operation time, which can be the configuration information B as target configuration information of a current sub-computation graph.

In this embodiment, the target configuration information is selected by determining the candidate configuration information with the minimum operation time in the candidate configuration information set, so that it can be ensured that the configuration information generated during the configuration of the sub-computation graph is the configuration information with the minimum operation time, and the optimization performance of the compiler can be further improved.

Further, determining the candidate configuration information with the minimum operation time from the candidate configuration information set includes: screening out a preset number of candidate configuration information from the candidate configuration information set to serve as test configuration information; acquiring the operation time corresponding to the test configuration information of the preset number respectively, and constructing an operation time loss function according to the operation time corresponding to the test configuration information of the preset number respectively; and predicting the candidate configuration information in the candidate configuration information set by using the operation time loss function, and taking the candidate configuration information with the minimum operation time loss function as the candidate configuration information with the minimum operation time.

Since the number of candidate configuration information obtained in step S402 is large, and if it is necessary to measure the running time of each candidate configuration information in a back-end test manner, a large amount of computing resources and running time are inevitably wasted, so the present embodiment provides a test time prediction of candidate configuration information by constructing a prediction function. The test configuration information is candidate configuration information used for constructing the prediction function, and the operation time loss function may be a prediction function for predicting candidate configuration information with the minimum operation time.

Specifically, the terminal may randomly select a certain number, i.e. a preset number, of candidate configuration information from the candidate configuration information set as the test configuration information, and through a back-end test, the test configuration information is tested for the operation time, so that the operation time corresponding to each test configuration information can be obtained, and then the terminal can obtain the operation time corresponding to each test configuration information according to the test configuration information, training the operation time corresponding to the test configuration information, constructing a corresponding operation time loss function used as a prediction function, and the other candidate configuration information in the candidate configuration information set can be predicted through the operation time loss function, after a prediction result is obtained, and taking the candidate configuration information with the minimum operation time loss function as the candidate configuration information with the minimum operation time, namely the target configuration information of the current sub-calculation graph.

In this embodiment, the terminal may further predict candidate configuration information with the minimum computation time by using the constructed computation time loss function, and compared with a case where it is necessary to measure the running time by using a back-end test for each candidate configuration information, this embodiment may save the computation resources and reduce the running time.

In an application example, a method for designing a neural network compiler based on graph and tensor joint optimization is further provided, and a technical framework of the method can be as shown in fig. 5, and specifically may include the following stages:

(1) the first stage is as follows:

firstly, training a model by using a mainstream deep learning framework to obtain a trained neural network model, for example, using ViT model for an image classification task, using a DETR model for a target detection task, and using SETR model for an image segmentation network. Because the model is a deep learning model formed by a convolutional neural network and a transformer together, and can complete the task of end-to-end image recognition.

(2) And a second stage:

secondly, by firstly assigning different calculation types to each operator in the calculation graph, in the optimization process, five calculation types exist, wherein the first calculation type is called injective, the second calculation type is called reduction, the third calculation type is called complex-out-fusable, the fourth calculation type is called element-wise, and the fifth calculation type is called opaque. Before proceeding with our algorithm, all the operator calculation types are defaulted to opaque types. Firstly, defining strategies and scheduling schemes for operator fusion as S = { (V1, F1), (V2, F2),. and (Vk, Fk) }, wherein Vi represents all operators contained in the ith stage, Fi represents the fusion relationship between any two operators, the maximum stage number is the number of nodes contained in the computational graph, and the fusion relationship is only two cases, fusion or non-fusion. Finally, the scheduling order for the entire computation graph is { (V1, F1), (V2, F2),., (Vk, Fk) }. With the scheduling strategy, an optimization objective function for a specific computation graph is obtained, and given a computation graph G and a fusion strategy S, the optimization objective is to find a schedule so that the cost of executing the computation graph on the GPU is minimum. Next, we describe the dynamic programming algorithm in detail. When we have the maximum number of stages of computational graph execution order and scheduling, we can partition the initial set of nodes in the computational graph into two disjoint subsets V-V 'and V'. And, the nodes in both subsets V-V 'and V' are such that the points in V-V 'are pointing to V', the specific relationship is shown in FIG. 6.

With respect to the dynamic programming algorithm, we define dp [ V ]]Representing the execution time of the node set V under the optimal scheduling, temp [ V']Then (V', F) is indicated. Then, the final state transition equation dp [ V ] can be obtained]=min(dp[V-V’]+Σ_vtemp[V’]) For the initial boundary condition dp [ ∅ ] of the dynamic programming method]And = 0. Aiming at the operators of the five calculation types, a designed dynamic programming algorithm is used for optimizing a sub-graph in the neural network, and the optimal result of the sub-graph is recorded, so that the recurrence of the whole graph is completed step by step. And storing the operator type configuration scheme with the optimal result to obtain a final fused calculation graph, so that the operator fusion scheme which enables the operation time delay of the calculation graph to be shortest can be obtained. FIG. 7 is a schematic diagram of a process of fusing two batch-based matrix multiplications and the softmax operator by a designed dynamic programming algorithm.

(3) And a third stage:

and then, the calculation graph fused in the first stage is sent to the back end of a compiler for optimization, the compiler divides the whole calculation graph into a plurality of subgraph structures according to the calculation intensity of operators, then each subgraph is optimized respectively, the subgraph in each stage is optimized without depending on an inherent scheduling template, and all possible configurations are enumerated through a contour automatic generation technology, so that a search space which can contain most calculation configurations is constructed. For each subgraph, a very good performance optimization is done and no tuning templates are written manually in advance. Automatic contour generation involves two techniques, the first is to construct the for-loop structure for all compute-intensive operators in the neural network, and does not generate any optimization details for the for-loop, only one general contour structure. The second technique, called stochastic labeling, is to perform stochastic assignment on the detail optimization information in the for-loop structure generated in the first step, so as to generate a plurality of candidate optimization schemes, where the candidate optimization schemes are sent to the cost model of the next stage for optimization, and an optimal configuration is selected, and fig. 8 explains an example of the automatic contour generation technique by fusing the matrix multiplication of batch and softmax operator.

(4) A fourth stage:

a large number of configurations are created in the third stage, it is not possible to test them all on the back-end and then take out the one with the best performance. As such methods can waste significant computing resources and runtime. The compiler adopts the scheme that part of the configuration searched out in the third stage is transmitted to the final hardware rear end for testing, the final test result is taken out and trained, and a loss function is constructed, wherein the loss function is a GBDT-based model quadratic loss function. And then, predicting the newly searched configuration in the third stage through a loss function, and selecting the configuration schemes with smaller loss function errors as possible target configurations, thereby generating a final target code.

In the application example, the optimization of the designed compiler at the graph level is based on a dynamic programming algorithm, rather than completely depending on the rules of expert manual design, so that the compiler can also have good tuning results for some emerging operators and model structures. Rather than merely limiting the optimization space to manual experience, it provides the possibility to explore a larger search space. Meanwhile, a proprietary fusion rule and a contour generation technology are designed for an operator fusion mode explored by a dynamic programming algorithm and a specific model structure in a transform model, such as the fusion of batch matrix multiplication and a softmax operator, so that the back end of a compiler optimizes the sub-graph model in a larger search space.

It should be understood that although the various steps in the flow charts of fig. 1-4 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1-4 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.

In one embodiment, as shown in fig. 9, there is provided a neural network compiler configuration apparatus including: a target network obtaining module 901, an operator set obtaining module 902, an operator combination obtaining module 903, a target combination determining module 904, and a compiling information configuring module 905, wherein:

a target network obtaining module 901, configured to obtain a target neural network model to be configured by a compiler and an initial calculation graph corresponding to the target neural network model; the initial computation graph comprises a plurality of operators;

an operator set obtaining module 902, configured to divide multiple operators to obtain multiple operator sets;

an operator combination obtaining module 903, configured to determine multiple operator type combinations corresponding to each operator set, and obtain operator operation times corresponding to each operator type combination, so as to obtain multiple operator operation times corresponding to each operator set;

a target combination determining module 904, configured to use, as a target operator type combination corresponding to each operator set, an operator type combination with a minimum operator operation time among the multiple operator type combinations;

and the compiling information configuration module 905 is configured to generate a target calculation graph of the target neural network model according to the target operator type combination corresponding to each operator set, and generate compiler configuration information for the target neural network model according to the target calculation graph.

In an embodiment, the operator combination obtaining module 903 is further configured to determine a current operator type combination, and obtain an operator type of each operator corresponding to the current operator type combination; determining operator fusion relation information corresponding to the current operator type combination according to the operator type of each operator; and acquiring the operator running time corresponding to the current operator type combination based on the operator fusion relation information.

In an embodiment, the operator combination obtaining module 903 is further configured to determine a fusion operator included in the current operator type combination based on the operator fusion relationship information; acquiring operator running time of the fusion operator; and obtaining the operator running time corresponding to the current operator type combination according to the operator running time of the fusion operator and the operator running time of the non-fusion operator contained in the preset current operator type combination.

In an embodiment, the compiling information configuring module 905 is further configured to split the target computation graph to obtain a plurality of sub-computation graphs corresponding to the target computation graph; acquiring a plurality of candidate configuration information corresponding to each sub-calculation graph to form a candidate configuration information set, and determining target configuration information corresponding to each sub-calculation graph from the candidate configuration information set; and obtaining compiler configuration information of the target neural network model according to the target configuration information corresponding to each sub-computation graph.

In one embodiment, the compiling information configuring module 905 is further configured to obtain a configuration information structure template set for the current sub-computation graph and multiple sets of configuration information parameters randomly generated for the configuration information structure template; and obtaining a plurality of candidate configuration information corresponding to the current sub-computation graph according to the configuration information structure template and the plurality of groups of configuration information parameters.

In an embodiment, the compiling information configuring module 905 is further configured to determine candidate configuration information with the minimum operation time from the candidate configuration information set; and taking the candidate configuration information with the minimum operation time as the target configuration information of the current sub-computation graph.

In an embodiment, the compiling information configuring module 905 is further configured to filter out a preset number of candidate configuration information from the candidate configuration information set as test configuration information; acquiring the operation time corresponding to the test configuration information of the preset number respectively, and constructing an operation time loss function according to the operation time corresponding to the test configuration information of the preset number respectively; and predicting the candidate configuration information in the candidate configuration information set by using the operation time loss function, and taking the candidate configuration information with the minimum operation time loss function as the candidate configuration information with the minimum operation time.

For specific limitations of the neural network compiler configuration apparatus, reference may be made to the above limitations of the neural network compiler configuration method, which will not be described herein again. The modules in the neural network compiler configuration apparatus may be wholly or partially implemented by software, hardware, or a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a neural network compiler configuration method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A neural network compiler configuration method, the method comprising:

dividing the operators to obtain a plurality of operator sets;

2. The method of claim 1, wherein obtaining the operator runtime for each operator type combination comprises:

determining a current operator type combination, and acquiring the operator types of operators corresponding to the current operator type combination;

determining operator fusion relation information corresponding to the current operator type combination according to the operator type of each operator;

and acquiring the operator running time corresponding to the current operator type combination based on the operator fusion relation information.

3. The method according to claim 2, wherein the obtaining operator running time corresponding to the current operator type combination based on the operator fusion relationship information includes:

determining fusion operators contained in the current operator type combination based on the operator fusion relation information;

acquiring operator running time of the fusion operator;

and obtaining the operator running time corresponding to the current operator type combination according to the operator running time of the fusion operator and the operator running time of the non-fusion operator contained in the preset current operator type combination.

4. The method of claim 1, wherein generating compiler configuration information for the target neural network model from the target computational graph comprises:

splitting the target calculation graph to obtain a plurality of sub-calculation graphs corresponding to the target calculation graph;

acquiring a plurality of candidate configuration information corresponding to each sub-calculation graph to form a candidate configuration information set, and determining target configuration information corresponding to each sub-calculation graph from the candidate configuration information set;

and obtaining compiler configuration information of the target neural network model according to the target configuration information corresponding to each sub-computation graph.

5. The method according to claim 4, wherein the obtaining of the plurality of candidate configuration information corresponding to each sub-computation graph comprises:

acquiring a configuration information structure template set for a current sub-computation graph and a plurality of groups of configuration information parameters randomly generated for the configuration information structure template;

and obtaining a plurality of candidate configuration information corresponding to the current sub-computation graph according to the configuration information structure template and the plurality of groups of configuration information parameters.

6. The method of claim 5, wherein determining the target configuration information corresponding to each of the sub-computation graphs from the set of candidate configuration information comprises:

determining candidate configuration information with the minimum operation time from the candidate configuration information set;

and taking the candidate configuration information with the minimum operation time as the target configuration information of the current sub-computation graph.

7. The method of claim 6, wherein the determining the candidate configuration information with the smallest computation time from the candidate configuration information set comprises:

screening out a preset number of candidate configuration information from the candidate configuration information set to serve as test configuration information;

acquiring the operation time corresponding to the test configuration information of the preset number respectively, and constructing an operation time loss function according to the operation time corresponding to the test configuration information of the preset number respectively;

and predicting candidate configuration information in the candidate configuration information set by using the operation time loss function, and taking the candidate configuration information with the minimum operation time loss function as the candidate configuration information with the minimum operation time.

8. An apparatus for neural network compiler configuration, the apparatus comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.