WO2023150912A1

WO2023150912A1 - Operator scheduling operation time comparison method and device, and storage medium

Info

Publication number: WO2023150912A1
Application number: PCT/CN2022/075526
Authority: WO
Inventors: 胡以璇; 陈金林; 伍文龙
Original assignee: 华为技术有限公司
Priority date: 2022-02-08
Filing date: 2022-02-08
Publication date: 2023-08-17
Also published as: CN116897356A

Abstract

The present application relates to the field of data processing, and in particular to an operator scheduling operation time comparison method and device, and a storage medium. The method comprises: acquiring at least two candidate schedulings corresponding to a target computation expression, wherein the target computation expression is used for describing a computation logic of an operator; acquiring a cost comparison model, wherein the cost comparison model is a model obtained by training a neural network by using a plurality of sample schedulings; and according to the at least two candidate schedulings, invoking the cost comparison model to output a cost comparison result, wherein the cost comparison result is used for indicating a sorted order of magnitudes of execution durations of the at least two candidate schedulings on a target hardware platform. According to the present application, the relative magnitudes of the execution durations of different schedulings are directly compared without predicting absolute execution durations of the schedulings, thereby achieving an automatic tuning function of a compiler/automatic tuner, and greatly improving the speed and accuracy of the evaluation of scheduling operation cost.

Description

Operator scheduling running time comparison method, device and storage medium

technical field

The present application relates to the field of data processing, and in particular to a method, device and storage medium for comparing scheduled running time of operators.

Background technique

An operator is used to indicate a data processing operation. For example, a neural network usually includes a convolution operator and a pooling operator. The convolution operator is used to indicate a convolution operation, and the pooling operator is used to indicate a pooling operation. operation. In order to run the operator on the actual hardware platform to perform corresponding data processing operations, it is necessary to generate the executable code of the operator. Among them, the generation process of the operator's executable code is divided into two steps: calculation expression and scheduling. Computational expression refers to describing the computational logic of an operator through a specific language, that is, describing the tasks that the operator needs to complete, as well as the input and output of the operator, and then converting the language that describes the computational logic of the operator into an intermediate language. The operator's intermediate representation information (also called a template) can be obtained. Scheduling refers to scheduling and optimizing the intermediate representation information of operators according to the hardware characteristics of the target hardware platform. Afterwards, the scheduling-optimized intermediate representation information can be converted into executable code recognizable by the target hardware platform.

Automatic operator optimization is an important function of optimization tools and compilers. The difficulty of automatic operator optimization is that it needs to search for the optimal scheduling implementation for a specific hardware platform in the scheduling space formed by massive scheduling. How to evaluate the execution time of different scheduling of operators in the neural network on the hardware platform is the most important thing for the success of optimization. In order to evaluate the execution time of scheduling on a specific hardware platform, in the related art, a pre-trained cost model can be used to evaluate the absolute execution time of scheduling, so as to realize the evaluation of scheduling running cost. However, in this method, the error between the predicted absolute execution time and the real execution time is relatively large, and professionals are required to build a dedicated cost model for a specific hardware platform, which often requires a large amount of training data and a complex model structure. In addition, due to the relatively large prediction error in this method, the uncertainty of cost comparison between schedules with similar predicted values cannot be eliminated.

In related technologies, a reasonable and effective method for evaluating scheduling operation cost has not been provided.

Contents of the invention

In view of this, a scheduling running time comparison method, device and storage medium of operators are proposed. The embodiment of the present application provides a scheduling running time comparison method, device, and storage medium of an operator. On the premise of not predicting the absolute execution time of the scheduling, the relative size of the execution time of different scheduling is directly compared, so as to realize the compiler/automatic The automatic tuning function of the optimizer greatly improves the evaluation speed and accuracy of the scheduling operation cost.

In the first aspect, the embodiment of the present application provides a method for comparing the scheduled running time of operators, the method including:

Obtain at least two candidate schedules corresponding to the target computing expression, the target computing expression is used to describe the computing logic of the operator, and the candidate scheduling is the available schedule of the operator on the target hardware platform generated based on the target computing expression execute code;

Obtaining a cost comparison model, where the cost comparison model is a model obtained by training a neural network using multiple sample scheduling;

According to the at least two candidate schedules, the output of the cost comparison model is invoked to obtain a cost comparison result, and the cost comparison result is used to indicate the order of the execution durations of the at least two candidate schedules on the target hardware platform.

In this implementation, by obtaining at least two candidate schedules corresponding to the target computing expression, and according to the call cost comparison model of the at least two candidate schedules, directly compare the relative size of the execution time of the at least two candidate schedules on the target hardware platform, thereby The output is the cost comparison result used to indicate the sorting of the execution time length, which can realize the automatic tuning function of the compiler/automatic optimizer, and greatly improve the evaluation speed and accuracy of the scheduling operation cost.

In a possible implementation manner, according to the at least two candidate schedules, calling the cost comparison model to output the cost comparison result includes:

Preprocessing the at least two candidate schedules to obtain the preprocessed at least two candidate schedules;

inputting the preprocessed at least two candidate schedules into the cost comparison model, and outputting the cost comparison result;

Wherein, the cost comparison model is trained according to at least one set of sample data sets, and each set of sample data sets includes: at least two sample schedules corresponding to sample calculation expressions and pre-marked correct cost comparison results.

In this implementation, by preprocessing at least two candidate schedules, at least two preprocessed candidate schedules are obtained, and the preprocessed at least two candidate schedules are input into the cost comparison model, and the cost comparison result is output , since the cost comparison model is trained according to at least one set of sample data sets, the high accuracy of the cost comparison model is guaranteed, and the accuracy of the cost comparison result obtained through the output of the cost comparison model is further guaranteed.

In another possible implementation manner, the preprocessing the at least two candidate schedules to obtain the preprocessed at least two candidate schedules includes:

For each of the candidate schedules in the at least two candidate schedules, performing feature extraction on the candidate schedules to obtain a feature matrix;

Perform normalization processing on the feature matrix corresponding to the candidate schedule to obtain the preprocessed candidate schedule.

In this implementation, for each candidate schedule in at least two candidate schedules, feature extraction is performed on the candidate schedule to obtain a feature matrix; the feature matrix corresponding to the candidate schedule is normalized to obtain a preprocessed candidate schedule, By preprocessing the candidate scheduling and transforming it into a special data structure, the accuracy of the cost comparison results obtained by the subsequent model output is further guaranteed.

In another possible implementation manner, the characteristic matrix is used to indicate at least one of cycle information, input data shape information, calculation code, axis type code, and data access type code, and the cycle information includes the Information related to the calculation logic of the candidate scheduling cycle, the input data shape information is used to describe the input data of the operator, the calculation code includes the code of the calculation instruction used in the current cycle of the candidate schedule, the Axis type encoding includes type encoding for operations on the axis, and data access type encoding includes type encoding for accessing data.

In this implementation, the candidate schedule is converted into a feature matrix, which may include at least one of five types of information, which are indication cycle information, input data shape information, calculation code, axis type code, and data The access type is encoded, so that the feature matrix with this data structure is used as the input data of the cost comparison model, which further improves the accuracy of the cost comparison result obtained by the subsequent model output.

In another possible implementation, the acquisition cost comparison model also includes:

Obtain a training sample set, the training sample set includes at least one set of the sample data set;

For each set of sample data sets, preprocessing at least two sample schedules is performed to obtain the preprocessed at least two sample schedules;

inputting the preprocessed at least two sample schedules into an original parameter model to obtain a training result, and the original parameter model is a neural network model;

comparing the training result with the correct cost comparison result to obtain a calculation loss, the calculation loss being used to indicate an error between the training result and the correct cost comparison result;

According to the calculated losses corresponding to the at least one set of sample data groups, the cost comparison model is obtained through training with an error back propagation algorithm.

In this implementation, before obtaining the cost comparison model, a training sample set is obtained, and the training sample set includes at least one set of sample data groups; for each set of sample data sets, at least two sample schedules are preprocessed to obtain preprocessing at least two sample schedules after preprocessing; input at least two sample schedules after preprocessing into the original parameter model to obtain the training result, and the original parameter model is a neural network model; compare the training result with the correct cost comparison result to obtain the calculation loss, and calculate The loss is used to indicate the error between the training result and the correct cost comparison result; according to the calculated losses corresponding to at least one set of sample data groups, the error back propagation algorithm is used to train the cost comparison model, so as to obtain the pre-trained evaluation operator The cost comparison model of scheduling running cost ensures the feasibility of the subsequent calling model to realize the scheduling running time comparison method of operators.

In another possible implementation manner, after calling the cost comparison model to output the cost comparison result according to the at least two candidate schedules, the method further includes:

adding the at least two candidate schedules and the cost comparison result to the training sample set to obtain an updated training sample set;

The cost comparison model is trained according to the updated training sample set to obtain an updated cost comparison model.

In this implementation, an updated training sample set is obtained by adding at least two candidate scheduling and cost comparison results to the training sample set; the cost comparison model is trained according to the updated training sample set to obtain the updated cost Compare models, so as to update the cost comparison model in time, and continuously improve the accuracy of the cost comparison model.

In the second aspect, the embodiment of the present application provides an operator scheduling runtime comparison device, the device includes:

A first acquisition unit, configured to acquire at least two candidate schedules corresponding to a target calculation expression, the target calculation expression is used to describe the calculation logic of an operator, and the candidate schedule is the operator generated based on the target calculation expression the executable code;

The second acquisition unit is used to acquire a cost comparison model, and the cost comparison model is a model obtained by training a neural network by adopting multiple sample scheduling;

The calling unit is configured to call the output of the cost comparison model according to the at least two candidate schedules to obtain a cost comparison result, and the cost comparison result is used to indicate the order of the execution durations of the at least two candidate schedules.

In a possible implementation manner, the calling unit is also used for:

In another possible implementation manner, the calling unit is also used for:

In another possible implementation manner, the device further includes a training unit; the training unit is used for:

In another possible implementation manner, the device further includes an update unit; the update unit is configured to:

In the third aspect, the embodiment of the present application provides an operator scheduling runtime comparison device, the device includes:

processor;

memory for storing processor-executable instructions;

Wherein, the processor is configured to implement the above method when executing the instructions.

In a fourth aspect, the embodiment of the present application provides a non-volatile computer-readable storage medium, on which computer program instructions are stored, and the above-mentioned method is implemented when the computer program instructions are executed by a processor.

In a fifth aspect, an embodiment of the present application provides a computer program product, and when the computer program product is run on a computer, the computer executes the above-mentioned method.

Description of drawings

The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate exemplary embodiments, features, and aspects of the application and, together with the specification, serve to explain the principles of the application.

Fig. 1 shows a schematic diagram of a generation process of a scheduling space in the related art.

Fig. 2 shows a schematic diagram of the principles of the actual measurement method and the cost model method in the related art.

Fig. 3 shows a schematic structural diagram of a computer device provided by an exemplary embodiment of the present application.

Fig. 4 shows a flowchart of a method for comparing scheduled running time of operators provided by an exemplary embodiment of the present application.

Fig. 5 shows a schematic diagram of the principle of a method for comparing scheduled running time of operators provided by an exemplary embodiment of the present application.

Fig. 6 shows a flow chart of the training process of the cost comparison model provided by an exemplary embodiment of the present application.

Fig. 7 shows a schematic diagram of a training process of a cost comparison model provided by an exemplary embodiment of the present application.

Fig. 8 shows a schematic diagram of an input-output curve of a normalization function provided by an exemplary embodiment of the present application.

Fig. 9 shows a schematic diagram of a network structure of a multi-layer perceptron architecture provided by an exemplary embodiment of the present application.

Fig. 10 shows a flowchart of a method for comparing scheduled running time of operators provided by another exemplary embodiment of the present application.

Fig. 11 shows a schematic diagram of a data structure of a feature matrix provided by an exemplary embodiment of the present application.

Fig. 12 shows a schematic diagram of the application process of the cost comparison model provided by another exemplary embodiment of the present application.

Fig. 13 shows a block diagram of an apparatus for comparing scheduled runtimes of operators provided by an exemplary embodiment of the present application.

Detailed ways

Various exemplary embodiments, features, and aspects of the present application will be described in detail below with reference to the accompanying drawings. The same reference numbers in the figures indicate functionally identical or similar elements. While various aspects of the embodiments are shown in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration." Any embodiment described herein as "exemplary" is not necessarily to be construed as superior or better than other embodiments.

In addition, in order to better illustrate the present application, numerous specific details are given in the following specific implementation manners. It will be understood by those skilled in the art that the present application may be practiced without certain of the specific details. In some instances, methods, means, components and circuits well known to those skilled in the art have not been described in detail in order to highlight the gist of the present application.

With the rapid development of artificial intelligence technology, deep learning has been widely used in many fields, and the demand for computing resources in these applications has increased rapidly, so the optimization of deep learning algorithms is becoming more and more important. Deep learning technology establishes a deep learning model and iteratively fits a large amount of historical data (model training), so that the model can establish a mapping relationship between input and output, thereby realizing the prediction of new input data results (model reasoning). The deep learning model contains a large number of operators, such as: convolution operator, fully connected operator, pooling operator, etc. The whole formed by the stacking and connection of different operators constitutes a deep learning model, also known as a neural network model. The topology of the neural network is called the neural network architecture; the parameters of the operators contained in the neural network are model parameters. In order to enable the operator to execute efficiently on a specific hardware platform, it is necessary to deeply optimize the calculation expression of the operator. Wherein, the specific hardware platform may be a central processing unit (Central Processing Unit, CPU), a graphics processing unit (Graphics Processing Unit, GPU), or a neural network processing unit (Neural network Processing Unit, NPU).

There are many ways to implement the calculation expression of an operator, which is called scheduling. The performance difference of different scheduling on a specific hardware platform can be very large. From a large number of scheduling implementations, using a compiler/automatic optimizer to automatically search for the optimal scheduling for specific hardware can optimize deep learning applications, thereby reducing computing power requirements and increasing system throughput. In engineering implementation, there may be an intermediate expression between calculation expression and scheduling, which is called a template. Computation expressions can form multiple templates, and each template can generate multiple schedules.

Automatic optimization of operators is an important function of optimization tools and compilers. The performance of operators after automatic optimization determines whether deep learning models can be efficiently applied and meet product requirements. The difficulty of operator automatic optimization is that it needs to search for the optimal scheduling implementation for a specific hardware platform in the scheduling space formed by massive scheduling. How to evaluate the execution time of different scheduling of operators in the neural network on the hardware platform is the most important thing for the success of optimization. The "cost" mentioned in this article refers to the execution time of scheduling on the hardware platform. In order to evaluate the execution time of a scheduler on a hardware platform, there are currently two main methods: the actual measurement method and the cost model method.

The actual measurement method refers to the code generation of each schedule, the code compilation, and then running on the hardware. The specific execution time is obtained by measuring the running time, which requires a complete compilation process. Its disadvantage is that it takes a long time to evaluate the scheduling (above second level), which takes too long in the actual 100,000 or millions of scheduling space scenarios; limited by the search time, it is difficult to explore a larger scheduling space.

The cost model method refers to evaluating the execution time of scheduling by establishing a cost model. Because this method does not need to go through the process of compiling, running and measuring, it has a very obvious advantage in terms of time-consuming evaluation.

In related technologies, the methods based on the cost model all use the absolute execution time of the forecasted scheduling to realize the evaluation of the scheduling operation cost. However, in this method, the error between the predicted absolute execution time and the real execution time is relatively large, and professionals are required to build a dedicated cost model for a specific hardware platform, which often requires a large amount of training data and a complex model structure. In addition, due to the relatively large prediction error in this method, the uncertainty of cost comparison between schedules with similar predicted values cannot be eliminated. The above shortcomings limit the application of the cost model method in the related art in the actual optimization process.

The embodiment of the present application provides a scheduling running time comparison method, device, and storage medium of an operator. On the premise of not predicting the absolute execution time of the scheduling, the relative size of the execution time of different scheduling is directly compared, so as to realize the compiler/automatic The automatic tuning function of the optimizer greatly improves the evaluation speed and accuracy of the scheduling operation cost. Compared with the methods in related technologies, the operator scheduling runtime comparison method provided by the embodiment of this application has strong advantages in speed and accuracy, improves the performance of the operator optimizer and significantly reduces the evaluation time.

First, some nouns involved in this application are introduced.

1. Computation expression (compute): refers to the whole composed of operator input data, output data and calculation logic. A calculation expression is an instance that describes a specific calculation process. In the operator automatic optimization framework, the calculation expression can be user-defined, and the calculation expression is used to complete all the information of the calculation logic functions required by the user. The form of computational expression is usually in the form of pseudocode or structured flowchart, which is easy to write but not optimized.

2. Template: Computational expressions can be transformed into templates through a series of equivalent transformations. The template is the intermediate representation information between the calculation expression and the scheduling during the optimization process of the calculation expression structure. Generally speaking, the template determines the order of calculation execution in the calculation expression logic and the mode of data access.

The template changes the calculation execution sequence and data access mode of the calculation expression, but does not restrict how the input data of the calculation expression is divided. For example, after the cycle is transformed by axis division, a single cycle can be divided into several sub-cycles, and the division of different numbers of sub-cycles is different templates. In each template, the loop upper and lower bounds of the sub-loop only need to be equivalent to the calculation expression, but the value of the loop upper and lower bounds of each sub-loop is uncertain.

3. Schedule: According to the hardware characteristics of the target hardware platform, the intermediate representation information of the operator is scheduled and optimized. Scheduling determines the specific expression of all variable parameters in the template, which can be transformed into a description of the calculation expression implemented by software. For the same input data, the scheduled output data is exactly the same as the output data of the calculation expression, but the calculation process can be different.

4. Feature embedding: the intermediate output of the input data after passing through the neural network module. Feature embedding is the mapping of the neural network module to the input data in another space, including the extraction, enhancement and encoding of the input data.

5. Multi-layer perceptron: a basic unit of neural network composed of fully connected layers, activation layers, etc. Multilayer perceptrons can form an overall neural network architecture, or they can appear as modules within a part of an overall architecture.

In a schematic example, taking the single-loop transformation as an example (pseudo-code), the generation process of the scheduling space is shown in Figure 1. The computer equipment (such as an automatic optimizer) obtains the calculation expression input by the user, and performs the calculation expression Transformation generates a template space, and the templates in the template space can be transformed into scheduling implementations whose logic is equivalent to computational expressions. The set of valid schedules forms the schedule space. The computer device searches in the scheduling space and outputs the optimal scheduling realization. In this example, the calculation is expressed as a user-defined loop calculation, and the calculation logic in the loop body is represented by a statement (stmt). The upper and lower bounds of the loop are from 0 to 546756. After axis division and transformation, the single loop of calculation expression can be equivalently transformed into double nested loop, triple nested loop to N-fold nested loop template. In the template, the loop upper and lower bounds of each nested loop are not determined. Axis splits allow for different plans for data access patterns. Fill in the template with the cycle boundary value equivalent to the cycle upper and lower bounds expressed by the calculation, and reasonably deform or constrain the cycle display (for example, stmt_tpln_immd_constrain in Figure 1 can be the middle constraint display of the nth template), The formed code that expresses the logic equivalent to calculation is scheduling.

In actual scenarios, the calculation expression composed of complex calculation logic can usually derive tens of millions of schedule implementations, and the execution time of different schedules on the target hardware platform can vary by hundreds or thousands of times. The automatic optimization system uses a series of operations to search for hardware to perform the optimal scheduling in the search space formed by massive scheduling to optimize operators.

As shown in Figure 2, the actual measurement method generates legal code according to the definition of scheduling, compiles it with a compiler, executes and measures it on the hardware, and obtains performance evaluation results. The result is usually the execution time of the schedule, and can also be a count of hardware clock cycles required to run. Through the actual measurement of a limited number of scheduling implementations, the scheduling with the shortest execution time (smallest hardware clock cycle count) is finally selected. The execution time obtained in this way is the most accurate and true. The disadvantage is that the code generation and compilation process usually takes several seconds to several minutes to complete. The time consumption of the operation and measurement process depends on the calculation amount and complexity of the operator. The optimization process is very slow.

The search and selection method based on the machine learning model can greatly accelerate the above process, shortening the process of code generation, compilation and running in a few seconds to the neural network reasoning process in milliseconds. At the same time, due to the limitation of the accuracy of model prediction, the effect of optimal selection may decline. The current cost model is as described above. After feature extraction is performed on the schedule, the absolute execution time or number of running cycles of the schedule is predicted by calling the cost model. The accuracy of the cost model in this method is low, and the cost model in the related art has an average error of 16% in the prediction of the scheduling execution time, and the actual execution time difference between quite a lot of scheduling is less than the 16% error value; except for the error In addition, the cost of obtaining the cost model is high, training requires 1.8 million training data, the complexity of the network architecture is high, and the training convergence time is long. Therefore, this method cannot well achieve the purpose of fast and accurate operator search optimization.

The embodiment of the present application provides a new cost model: a cost comparison model. The cost comparison model avoids directly predicting the scheduling execution time, and transforms the regression problem into a classification problem that is easy for neural network to learn. The cost comparison model takes at least two candidate schedules as input, and the output result is a cost comparison result, and the cost comparison result is used to indicate the order of the execution time of the at least two candidate schedules on the target hardware platform. The method provided by the embodiment of the present application has the advantages of high accuracy, fast inference speed, and lower training cost than existing methods. In the operator optimization process, the cost comparison model provided by the embodiment of the present application can quickly compare the execution time of different schedules, thereby realizing large-scale search optimization of operators.

It should be noted that the operator scheduling runtime comparison method provided in the embodiment of the present application can be applied to the optimization process of the automatic operator optimization system. The core content of the embodiment of the present application is the cost comparison model, including the model architecture, model training process and model application process of the cost comparison model. The operator scheduling runtime comparison method provided in the embodiment of the present application can be applied to a specific computer device (such as CPU or GPU or NPU), and performs large-scale comparison and search on multiple candidate scheduling implementations of the target computing expression, so as to obtain the optimal Optimal scheduling to achieve the purpose of optimizing the target computing expression on a specific computer device.

The execution subject of the method for comparing the scheduled running time of operators provided in the embodiment of the present application is a computer device, which may be a general-purpose computer device or a special-purpose computing device. Please refer to FIG. 3 , which shows a schematic structural diagram of a computer device provided by an exemplary embodiment of the present application.

The computer device may be a terminal or a server. Terminals include tablet computers, laptop computers, and desktop computers, among others. The server can be one server, or a server cluster composed of several servers, or a cloud computing service center.

As shown in FIG. 1 , the computer device includes a processor 10 , a memory 20 and a communication interface 30 . Those skilled in the art can understand that the structure shown in FIG. 1 does not constitute a limitation to the computer device, and may include more or less components than those shown in the illustration, or combine some components, or arrange different components. in:

The processor 10 is the control center of the computer equipment, and uses various interfaces and lines to connect various parts of the entire computer equipment, by running or executing software programs and/or modules stored in the memory 20, and calling data stored in the memory 20 , to perform various functions of the computer equipment and process data, thereby controlling the computer equipment as a whole. The processor 10 may be implemented by a CPU, or may be implemented by a GPU.

The memory 20 can be used to store software programs as well as modules. The processor 10 executes various functional applications and data processing by executing software programs and modules stored in the memory 20 . The memory 20 can mainly include a program storage area and a data storage area, wherein the program storage area can store an operating system 21, a first acquisition unit 22, a second acquisition unit 23, a detection unit 24 and at least one functionally required application program 25 ( Such as neural network training, etc.); the storage data area can store data created according to the use of computer equipment, etc. Memory 20 can be realized by any type of volatile or nonvolatile memory device or their combination, such as Static Random Access Memory (Static Random Access Memory, SRAM), Electrically Erasable Programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), Erasable Programmable Read Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read Only Memory (Read Only Memory, ROM), magnetic memory, flash memory, magnetic or optical disk. Correspondingly, the memory 20 may also include a memory controller to provide the processor 10 with access to the memory 20 .

Wherein, the processor 20 executes the following function by running the first acquisition unit 22: acquire at least two candidate schedules corresponding to the target calculation expression, the target calculation expression is used to describe the calculation logic of the operator, and the candidate schedule is based on the The executable code of the operator generated by the target calculation expression; the processor 20 performs the following functions through the second acquisition unit 23: acquire a cost comparison model, and the cost comparison model is obtained by training the neural network using multiple sample scheduling Model; the processor 20 performs the following functions through the calling unit 24: according to the at least two candidate schedules, call the cost comparison model output to obtain a cost comparison result, and the cost comparison result is used to indicate the execution duration of the at least two candidate schedules sorted by size.

Optionally, the computer device obtains the calculation expression code input by the user, that is, the target calculation expression, analyzes the target calculation expression through the operator optimization system, generates a template space based on optimization rules or polyhedron models, and generates a large number of A legal candidate schedule, the generated multiple candidate schedules form a scheduling space. An instance in the scheduling space represents a legal scheduling. The cost comparison model provided by the embodiment of this application is used as an evaluation module to compare and output at least two candidate schedulings input to obtain the cost comparison result, so as to realize the search for the optimal scheduling space. The target of the schedule.

In the following, a method for comparing scheduled running time of operators is introduced by using a schematic embodiment.

Please refer to FIG. 4 , which shows a flowchart of a method for comparing scheduled running time of operators provided by an exemplary embodiment of the present application. This embodiment is described by taking the method for comparing the scheduling running time of the operator applied to the computer device shown in FIG. 3 as an example. The scheduling running time comparison methods of this operator include:

Step 401: Acquire at least two candidate schedules corresponding to the target computing expression. The target computing expression is used to describe the computing logic of the operator. The candidate schedule is the executable code of the operator generated based on the target computing expression on the target hardware platform.

Optionally, the computer device acquires at least two candidate schedules from the schedule space corresponding to the target computation expression. Schematically, the computer device obtains the input target calculation expression, analyzes the target calculation expression, generates a template space according to a preset method, generates multiple candidate schedules by instantiating the template, and the generated multiple candidate schedules constitute a scheduling space. The computer device obtains at least two candidate schedules from the schedule space.

Optionally, the preset method is a dynamic programming method or an optimization rule method or a polyhedron model method. For different computing systems, the preset methods may also be different. The embodiment of the present application does not limit the generation algorithm of the scheduling space. The embodiment of the present application can only be applied if at least two candidate schedules are included in the scheduling space for comparison.

Wherein, the target calculation expression is a specific calculation expression, for example, the target calculation expression is an input calculation expression.

The candidate scheduler is the executable code of the operator generated based on the target computing expression on the target hardware platform. For example, the target hardware platform is CPU, GPU or NPU. This embodiment of the present application does not limit it.

Optionally, when the computer device receives the preset acquisition instruction, it acquires at least two candidate schedules corresponding to the target computing expression. Alternatively, the computer device acquires at least two candidate schedules corresponding to the target computing expression every preset time interval. Alternatively, the computer device acquires at least two candidate schedules corresponding to the target computing expression in real time.

Wherein, the preset time interval is a default setting or a custom setting, which is not limited in this embodiment.

Step 402, acquiring a cost comparison model, which is a model obtained by training a neural network by using multiple sample scheduling.

The computer device obtains the trained cost comparison model. In a possible implementation manner, when the computer device is a terminal, the terminal obtains a trained cost comparison model stored by itself, or obtains a trained cost comparison model from a server. In another possible implementation manner, when the computer device is a server, the server obtains a trained cost comparison model stored in itself.

The cost comparison model is a model obtained by training the neural network by using at least two sample scheduling and correct cost comparison results. That is, the cost comparison model is determined according to at least two sample scheduling and correct cost comparison results. Wherein, the correct cost comparison result is a pre-marked correct cost comparison result corresponding to at least two sample schedules.

Among them, the neural network of the cost comparison model can adopt an end-to-end stacked multi-layer perceptron architecture. Other reasonable deformation architectures can also achieve the fitting function of the cost comparison model, and different architectures will affect the final accuracy of the model. Any network architecture formed by deformation, derivation, and layer replacement of this architecture should be regarded as equivalent to the neural network described in the embodiments of this application.

For example, the neural network is a deep neural network (Deep Neural Network, DNN). For example, the neural network is a Convolutional Neural Network (CNN). For another example, the neural network is a Recurrent Neural Network (RNN). This embodiment of the present application does not limit it.

The cost comparison model is a neural network model that identifies relative execution times of at least two candidate schedules on the target hardware platform.

The cost comparison model is used to convert the input at least two candidate schedules into cost comparison results. The cost comparison result is used to indicate the ranking of the execution durations of the at least two candidate schedules on the target hardware platform.

The cost comparison model is used to represent the correlation between at least two candidate schedules and the cost comparison results.

The cost comparison model is a preset mathematical model, and the cost comparison model includes model coefficients between at least two candidate schedules and cost comparison results. The model coefficient can be a fixed value, or a value that is dynamically modified over time, or a value that is dynamically modified according to a usage scenario.

Step 403: According to the at least two candidate schedules, invoke the cost comparison model output to obtain a cost comparison result, and the cost comparison result is used to indicate the order of execution duration of the at least two candidate schedules on the target hardware platform.

Optionally, the computer device performs preprocessing on at least two candidate schedules to obtain at least two preprocessed candidate schedules; input the preprocessed at least two candidate schedules into the cost comparison model, and output the cost comparison results.

Wherein, the cost comparison result is used to indicate the order of the execution durations of at least two candidate schedules on the target hardware platform. That is, the cost comparison result does not indicate the absolute execution time of the at least two candidate schedules on the target hardware platform, but indicates the relative size of the execution time of the at least two candidate schedules on the target hardware platform.

Optionally, the cost comparison result is coding information of a comparison result of the predicted execution durations of at least two candidate schedules. The computer device decodes the coded information output by the cost comparison model, and obtains the order of the execution durations of at least two candidate schedules, that is, the comparison result.

Schematically, the cost comparison result includes encoding information, and the value of the encoding information is in one-to-one correspondence with the execution duration comparison results of at least two candidate schedules. Schematically, taking at least two candidate schedules as a first candidate schedule and a second candidate schedule as an example, when the encoded information is a first value, it is used to indicate that the execution duration of the first candidate schedule is shorter than the execution duration of the second candidate schedule, and the encoding When the information is the second value, it is used to indicate that the execution duration of the first candidate schedule is equal to the execution duration of the second candidate schedule, and when the encoded information is the third value, it is used to indicate that the execution duration of the first candidate schedule is greater than the execution duration of the second candidate schedule , where the first, second, and third values are different.

Optionally, the computer device selects the candidate schedule with the shortest execution time among the at least two candidate schedules as the target schedule according to the cost comparison results of the at least two candidate schedules, retains the target schedule, and discards other candidate schedules except the target schedule.

Optionally, when the cost comparison result indicates that the execution durations of at least two candidate schedules are the same, the computer device takes any one of the at least two candidate schedules as the target schedule, retains the target schedule, and discards all but the target schedule. other candidate schedules. The embodiment of the present application does not limit the method of retaining and discarding the scheduling.

In a schematic example, as shown in Figure 5, the computer device obtains the input target computing expression, analyzes the target computing expression, generates a template space according to a preset method, generates multiple candidate schedules by instantiating the template, and generates A plurality of candidate schedules constitute a scheduling space. Obtain two candidate schedules, such as schedule A and schedule B, from the schedule space. Preprocess the schedule A and schedule B to obtain the preprocessed schedule A and schedule B; input the preprocessed schedule A and schedule B into the cost comparison model to output encoded information, and decode the encoded information to obtain schedule A Compare the result with the cost of scheduling B. For example, when the encoding information is 001, it is used to indicate that the execution duration of schedule A is less than that of schedule B, and schedule A is retained, and schedule B is discarded; when the encoding information is 002, it is used to indicate that the execution duration of schedule A is equal to the execution duration of schedule B , keep schedule A or schedule B; when the encoding information is 100, it indicates that the execution time of schedule A is longer than that of schedule B, keep schedule B and discard schedule A.

To sum up, the embodiment of the present application obtains at least two candidate schedules corresponding to the target computing expression, and directly compares the relative execution time of the at least two candidate schedules on the target hardware platform according to the call cost comparison model of the at least two candidate schedules. Size, so as to output the cost comparison result used to indicate the size of the execution time, which can realize the automatic tuning function of the compiler/automatic optimizer, and greatly improve the evaluation speed and accuracy of the scheduling operation cost.

It should be noted that before the computer device acquires the cost comparison model, it needs to train the training sample set to obtain the cost comparison model. The training process of the cost comparison model is introduced below.

In a possible implementation, as shown in Figure 6, the training process of the cost comparison model includes the following steps:

In step 601, a training sample set is obtained, and the training sample set includes at least one set of sample data groups.

The cost comparison model is trained according to at least one set of sample data sets, and each set of sample data sets includes: at least two sample schedules corresponding to sample calculation expressions and pre-marked correct cost comparison results.

Step 602, for each set of sample data groups, perform preprocessing on at least two sample schedules to obtain at least two preprocessed sample schedules.

For each sample data group, the computer device performs feature extraction on each sample schedule in at least two sample schedules to obtain a feature matrix, and normalizes the feature matrix corresponding to the sample schedule to obtain a preprocessed sample schedule.

Schematically, feature extraction is the process of extracting features from sample schedules and converting features into structured data.

It should be noted that, for the relevant description of the feature matrix, reference may be made to the relevant details in the following embodiments, which will not be introduced here.

Step 603, input the preprocessed at least two sample schedules into the original parameter model to obtain a training result, and the original parameter model is a neural network model.

Optionally, the original parameter model is established according to the neural network model, for example: the original parameter model is established according to the DNN model.

Schematically, for each sample data set, the computer device creates an input-output pair corresponding to the set of sample data sets, the input parameter of the input-output pair is at least two sample schedules in the set of sample data sets, and the target parameter is the set The correct cost comparison results in the sample data set; the computer equipment inputs the input parameters into the original parameter model to obtain the training result.

Optionally, input-output pairs are represented by feature vectors.

Step 604, comparing the training result with the correct cost comparison result to obtain a calculation loss, which is used to indicate the error between the training result and the correct cost comparison result.

Optionally, the training result is the coding information output by the original parameter model, and the correct cost comparison result is the pre-marked coding information. For example, the encoded information is information encoded by one-hot code (One-Hot).

Optionally, the calculation loss is represented by cross entropy.

Step 605: According to the calculated losses corresponding to at least one set of sample data sets, use the error back propagation algorithm to train to obtain a cost comparison model.

Optionally, the computer device determines the gradient direction of the cost comparison model according to the calculation loss through the back propagation algorithm, and updates the model parameters in the cost comparison model layer by layer from the output layer of the cost comparison model.

In an illustrative example, take at least two candidate schedules as schedule A and schedule B as an example, as shown in FIG. 7 . The computer equipment extracts two schedules from the schedule space, that is, schedules A and B, as input data for cost comparison model training. Compare the relative execution time of the two schedules on the target hardware platform, and use one-hot encoding to generate (A, B) input encoding information (that is, the correct cost comparison result) as the target parameter of the backpropagation algorithm. The encoding information is shown in the table one shown. When the encoding information is the first value, it is used to indicate that the execution duration of schedule A is less than the execution duration of schedule B; when the encoding information is the second value, it is used to indicate that the execution duration of schedule A is equal to the execution duration of schedule B, and the encoding information is the third value When is used to indicate that the execution duration of schedule A is longer than that of schedule B.

Table I

The computer equipment performs feature extraction on schedule A and schedule B to obtain respective corresponding feature matrices. For example, the feature matrices of scheduling A and scheduling B are two 250x57-dimensional matrices. Part of the column data in the feature matrix is normalized to limit its dynamic range. The formula of the normalization function is as follows:

Among them, v is the input data, and v* is the output data. Schematically, the input-output curve of the normalization function is shown in FIG. 8 . Wherein, the abscissa is the above-mentioned input data, and the ordinate is the above-mentioned output data.

The computer equipment inputs the normalized schedule A to the feature embedding module A composed of the multi-layer perceptron, that is, DNN_A, outputs a 1x512-dimensional schedule embedding (schedule embedding) A, and inputs the normalized schedule B to at most The feature embedding module B composed of layer perceptrons is DNN_B, which outputs a 1x512-dimensional scheduling embedding B. The two schedule embeddings are bitwise subtracted, that is, the schedule embedding A is subtracted from the schedule embedding B to obtain a schedule difference embedding. Embed the scheduling difference into the deep network discriminant module, DNN_CLS, and output the training result, which is the encoded information of three numbers. According to the output data of the deep network discriminant module and the real labels of schedule A and schedule B, that is, the correct cost comparison result, the mean square error loss function (or minimum square error function) is used as the loss function to calculate the calculation loss of the model for the current input. The calculation loss is backpropagated through the gradient descent method, and the model parameters based on the neural network module such as DNN_A, DNN_B, and DNN_CLS are updated. Repeat the above steps to train the training sample set for multiple periods (for example, 30 periods) until the model converges. Wherein, the network structure of DNN_A, DNN_B, and DNN_CLS may be an end-to-end stacked multi-layer perceptron architecture, and the network structure is shown in FIG. 9 . Among them, the number represents the number of neurons in each layer, and the Relu function is used as the activation function between each fully connected layer.

Based on the cost comparison model completed by the above training, please refer to FIG. 10 , which shows a flowchart of a method for comparing scheduling runtimes of operators provided by another exemplary embodiment of the present application. This embodiment is described by taking the method for comparing the scheduling running time of the operator applied to the computer device shown in FIG. 3 as an example. The scheduling running time comparison methods of this operator include:

Step 1001, obtain at least two candidate schedules from the schedule space corresponding to the target computation expression.

Among them, the target computing expression is used to describe the computing logic of the operator, and the candidate schedule is the executable code of the operator generated based on the target computing expression on the target hardware platform.

Optionally, the computer device acquires an input target computing expression, analyzes the target computing expression, generates a template according to a preset method, and determines a scheduling space, where the scheduling space includes at least two candidate schedulings generated by instantiating the template. The computer device acquires at least two candidate schedules from the schedule space corresponding to the target computation expression.

Schematically, the scheduling space includes n candidate schedulings. In a possible implementation, each pairwise comparison is adopted, and the optimal method is kept for n-1 comparisons to obtain the optimal target scheduling. In another possible implementation, choose the dichotomous method for comparison, for example, n is 8, that is, divide the 8 schedules into 4 groups in pairs, and select 4 candidate schedules with the fastest running speed from the 4 groups through the cost comparison model for secondary Grouping, the secondary grouping is divided into 2 groups, and 2 comparisons are required. After the comparison is completed, the 2 optimal candidate schedulings are reserved for final comparison, so as to obtain the optimal target scheduling among the 8 candidate schedulings. The embodiment of the present application does not limit the scheduling group comparison method.

Step 1002, for each of the at least two candidate schedules, perform feature extraction on the candidate schedules to obtain a feature matrix.

Optionally, for each of the at least two candidate schedules, the computer device extracts multiple types of information from m cycles of the candidate schedule and combines them into a vector, which is a feature matrix corresponding to the candidate schedule, where m is a positive integer. For example, the combined vector size is 1x57. A maximum of 250 loops of information are supported, and finally assembled into a 250x57 two-dimensional feature matrix. The number of supported loops can vary according to actual needs, which is not limited in this embodiment of the present application.

Optionally, the feature matrix is used to indicate at least one of cycle information, input data shape information, calculation encoding, axis type encoding, and data access type encoding.

The round-robin information includes information related to the round-robin calculation logic of the candidate schedule. Optionally, the cycle information is cycle information at a level in the scheduling, for example, the size of the cycle information is 1x6. Wherein, the loop information includes: at least one of loop depth, nesting level, block number, flag indicating whether it is the last loop, quotient of loop depth, and remainder of loop depth. Among them, the quotient of the loop depth and the loop depth needs to be normalized.

The input data shape information is used to describe the input data of the operator. For example, the size of the input data shape information is 1x10. The operator is a single-input operator, a double-input operator, or a multi-input operator. The shape information of the input data includes: shape information corresponding to k input data, k is a positive integer, and the shape information includes at least one of batch size, number of channels, height, width, and minimum number of channels.

The computation encoding includes the encoding of the computation instruction used in the current cycle of the candidate schedule. For example, the size of the calculation code is 1x6. The computing code includes: at least one of memory access types, program instructions, data types, storage units, and identifiers for indicating whether to use double buffering.

Axis type encodings include encodings for the types of operations on the axes. For example, the size of the axis type code is 1x15. Axis type codes are used to indicate at least one operation among extended, normalized axes.

The data access type encoding includes the type encoding of the access to the data. For example, the size of the data access type encoding is 1x19. The data access type code is used to indicate at least one access among write data, read data, allocation, and pragma.

In a schematic example, feature extraction is performed on candidate schedules to obtain a feature matrix, and the data structure of the feature matrix is shown in FIG. 11 . Extract multiple types of information from each cycle of candidate scheduling and combine them into vectors. The size of the combined vector is 1x57, and it supports up to 250 cycles of information. Finally, it is assembled into a two-dimensional feature matrix with a size of 250x57, where the feature matrix is used to indicate Loop information, input data shape information, calculation encoding, axis type encoding and data access type encoding, the size of loop information is 1x6, the size of input data shape information is 1x10, 0, the size of calculation encoding is 1x6, the size of axis type encoding is 1x15, and the size of the data access type encoding is 1x20.

It should be noted that, in addition to the feature extraction, mapping methods, and data structures provided by the embodiments of the present application, other scheduling expression methods can also be used as the input of the cost comparison model. The embodiment of the present application does not limit the input data structure.

Step 1003 , for each of the at least two candidate schedules, normalize the feature matrix corresponding to the candidate schedules to obtain a preprocessed candidate schedule.

Step 1004, input the preprocessed at least two candidate schedules into the trained cost comparison model, output the cost comparison result, and the cost comparison result is used to indicate the execution duration of the at least two candidate schedules on the target hardware platform Sort.

Optionally, the computer device acquires a trained cost comparison model, and the cost comparison model is a model obtained by training a neural network by using multiple sample scheduling. The computer device inputs the preprocessed at least two candidate schedules into the trained cost comparison model, outputs a cost comparison result, and the cost comparison result is used to indicate the order of the execution time of the at least two candidate schedules on the target hardware platform

For the process of invoking the cost comparison model by the computer device, reference may be made to relevant details in the foregoing embodiments, which will not be repeated here.

Optionally, the computer device adds at least two candidate scheduling and cost comparison results to the training sample set to obtain an updated training sample set; train the cost comparison model according to the updated training sample set to obtain an updated cost comparison Model.

In an illustrative example, take at least two candidate schedules as schedule A and schedule B as an example, as shown in FIG. 12 . The computer equipment extracts two schedules A and B from the schedule space, and extracts features from the schedule A and schedule B to obtain their corresponding feature matrices. For example, the feature matrices of scheduling A and scheduling B are two 250x57-dimensional matrices. Part of the column data in the feature matrix is normalized to limit its dynamic range. The way of normalization can be compared with the description of normalization in the above model training process, and will not be repeated here. The computer equipment inputs the normalized scheduling A to the feature embedding module A composed of the multi-layer perceptron, that is, DNN_A, outputs a 1x512-dimensional scheduling embedding A, and inputs the normalized scheduling B to the multi-layer perceptron. The feature embedding module B of DNN_B outputs a 1x512-dimensional scheduling embedding B. The two scheduling embeddings are bitwise subtracted, that is, the scheduling embedding A is subtracted from the scheduling embedding B to obtain the scheduling difference embedding. Embed the scheduling difference into the deep network discriminant module, DNN_CLS, and output the cost comparison result, which is the encoding result of three numbers. Wherein, the network structure of DNN_A, DNN_B, and DNN_CLS can refer to the relevant description in the above model training process by analogy, and will not be repeated here. The computer equipment converts the outputted three-digit encoded information into a one-hot coded label format.

To sum up, the embodiment of the present application also performs feature extraction on at least two candidate schedules, maps the schedules to its unique corresponding matrix expression form, and obtains the feature matrix expressions of at least two candidate schedules; for the two feature matrix expressions Do normalization processing; the cost comparison model based on the deep neural network takes at least two preprocessed feature matrices as input, and the output is the coding information of the comparison result of the predicted execution time of at least two candidate schedules; the cost comparison The encoded information output by the model is decoded to obtain the comparison result of the execution time of at least two candidate schedules, that is, the execution time of different schedule implementations of the same calculation expression on a specific hardware platform is compared through the deep learning network model, thereby replacing the schedule implementation process. The process of running and measuring on the hardware after the compilation process solves the problem of slow speed in large-scale search of operator automatic optimization systems such as automatic optimizers/compilers.

In an illustrative example, the cost comparison model is implemented with the goal of predicting how fast operators will take to execute. The training sample set includes 20792 schedules from 32 operators, and each operator contains different schedules. For scheduling belonging to the same operator, perform pairwise pairing to form a training instance set, compare the execution time of the two operators after pairing, and generate the target of the paired training instance according to the above-mentioned related method. Example Extract schedule A and schedule B, the actual execution time of schedule A is 15 seconds, and the actual execution time of schedule B is 17 seconds, then (A, B) is a training instance, and the time of 15 seconds is less than 17 seconds, (A, B) The target encoding for this training instance is 001. The pairwise combination of schedules belonging to the same operator may include a combination of a schedule and the schedule itself in the training sample set, and the target code of the formed training instance is 010. They belong to the same operator scheduling combination, and the combination is sensitive to the order. For example, the (A, B) combination is different from the (B, A) combination. If the execution time of A and B is different, the (A, B) combination and (B, A) The combined target encodings are also different. If an operator contains N (N>2) schedules, then the combination of two pairs can form N square training instances. This combination of training, even if the amount of training data is relatively limited, can also build a relatively large training data set. In this example, there are 20,792 schedules, and a total of 49 million training instances and their target codes are used to train the model. The model structure is as described above and will not be repeated here. The neural network model adopts batch training, 5000 training examples are input for each iteration, the learning rate is set to 10e-8, and the momentum stochastic gradient descent method is used to train the complete training example set for multiple periods (for example, 30 periods). The test set includes 46022 test instances, and each test instance is composed of two schedules belonging to the same operator. Any schedule used to generate test instances is not included in the schedule set for generating training instances. The test target code is generated by the above-mentioned related method for the test instance. After the prediction result output by the network passes the maximum parameter (argmax) function, if it completely matches the test target code, it is recorded as the test instance is correctly predicted by the network. Accuracy is defined as: the number of test instances correctly predicted by the network/the total number of test instances tested. Tested on 46022 test cases, the method correctly predicts 41242 test cases with an accuracy rate of 89.61%. By increasing the number of training schedules and optimizing the network structure, the accuracy of the model can be further improved.

To sum up, the embodiment of the present application provides a scheduling running time comparison method for operators, which adopts the idea of cost comparison to determine the comparison result of the relative execution time of at least two schedules, and applies the cost comparison model to the operator automatic The optimization process of the optimization system also involves a modeling method of the cost comparison model that can be applied to the operator automatic optimization system, including the model architecture design, model training and model reasoning application process, and the model training and model In the process of inference application, scheduling can be converted into a special data structure through feature extraction, and the normalization processing of data and the expression of output format have high accuracy, fast inference speed, and the required training cost is lower than that of existing methods. low pros. That is to say, on the one hand, the higher accuracy of the cost comparison model is guaranteed; on the other hand, the reasoning speed of the cost comparison model is improved, and it only takes 3 milliseconds to compare a set of instances; on the other hand, the cost comparison model training requires The amount of data and computing power are relatively small, and 30 sessions of training on more than 49 million training instances are completed in 70 hours on a single GPU card. Through the cost comparison model, the code optimizer/compiler automatic tuning only needs to consider how to improve the accuracy of the cost comparison model. Compared with the cost model that predicts the absolute execution time of scheduling in related technologies, in addition to the accuracy of model prediction , the cost model in related technologies also needs to consider how to deal with the boundary problems caused by errors, for example: if the difference between the predicted running times of two schedules is smaller than the error predicted by the model, the absolute value model cannot give a high-confidence prediction at this time.

The following are device embodiments of the present application, which can be used to implement the method embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application.

Please refer to FIG. 13 , which shows a block diagram of an apparatus for comparing scheduled runtimes of operators provided by an exemplary embodiment of the present application. The apparatus can be implemented as all or a part of the computer equipment provided in FIG. 3 through software, hardware or a combination of the two. The apparatus may include: a first obtaining unit 1310 , a second obtaining unit 1320 and a calling unit 1330 .

The first acquisition unit 1310 is configured to acquire at least two candidate schedules corresponding to the target calculation expression, the target calculation expression is used to describe the calculation logic of the operator, and the candidate schedule is the executable code of the operator generated based on the target calculation expression;

The second acquisition unit 1320 is configured to acquire a cost comparison model, where the cost comparison model is a model obtained by training a neural network using multiple sample scheduling;

The calling unit 1330 is configured to call the output of the cost comparison model according to the at least two candidate schedules to obtain a cost comparison result, and the cost comparison result is used to indicate the order of the execution durations of the at least two candidate schedules.

In a possible implementation manner, the calling unit 1330 is also used to:

Preprocessing the at least two candidate schedules to obtain at least two preprocessed candidate schedules;

Input the preprocessed at least two candidate schedules into the cost comparison model, and output the cost comparison result;

In another possible implementation manner, the calling unit 1330 is also used to:

For each of the at least two candidate schedules, performing feature extraction on the candidate schedules to obtain a feature matrix;

The feature matrix corresponding to the candidate schedule is normalized to obtain the preprocessed candidate schedule.

In another possible implementation, the feature matrix is used to indicate at least one of cycle information, input data shape information, calculation encoding, axis type encoding, and data access type encoding, and the cycle information includes the cycle calculation logic of the candidate schedule Related information, the input data shape information is used to describe the input data of the operator, the calculation code includes the code of the calculation instruction used in the current cycle of the candidate schedule, the axis type code includes the type code for operating on the axis, and the data access type code Include type encodings for accessing data.

In another possible implementation, the device further includes a training unit; the training unit is used for:

Obtain a training sample set, where the training sample set includes at least one set of sample data sets;

For each sample data group, at least two sample schedules are preprocessed to obtain at least two sample schedules after preprocessing;

inputting the preprocessed at least two sample schedules into the original parameter model to obtain the training result, and the original parameter model is a neural network model;

Comparing the training result with the correct cost comparison result to obtain the calculated loss, which is used to indicate the error between the training result and the correct cost comparison result;

According to the calculated losses corresponding to at least one set of sample data sets, an error backpropagation algorithm is used to train the cost comparison model.

In another possible implementation, the device further includes an update unit; the update unit is used for:

Adding at least two candidate scheduling and cost comparison results to the training sample set to obtain an updated training sample set;

It should be noted that, when realizing the functions of the device provided by the above-mentioned embodiments, the division of the above-mentioned functional modules is used as an example for illustration. In practical applications, the above-mentioned function allocation can be completed by different functional modules according to the needs. The internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the device and the method embodiment provided by the above embodiment belong to the same idea, and the specific implementation process thereof is detailed in the method embodiment, and will not be repeated here.

An embodiment of the present application provides an operator scheduling runtime comparison device, the operator scheduling runtime comparison device includes: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to execute The instructions implement the methods executed by the computer device in the above-mentioned embodiments.

An embodiment of the present application provides a computer program product, including computer-readable codes, or a non-volatile computer-readable storage medium carrying computer-readable codes. When the computer-readable codes run in a processor, the processor Execute the method performed by the computer device in the foregoing embodiments.

An embodiment of the present application provides a non-volatile computer-readable storage medium, on which computer program instructions are stored. When the computer program instructions are executed by a processor, the methods performed by the computer device in the foregoing embodiments are implemented.

A computer readable storage medium may be a tangible device that can retain and store instructions for use by an instruction execution device. A computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (non-exhaustive list) of computer-readable storage media include: portable computer disk, hard disk, random access memory (Random Access Memory, RAM), read only memory (Read Only Memory, ROM), erasable Electrically Programmable Read-Only-Memory (EPROM or flash memory), Static Random-Access Memory (Static Random-Access Memory, SRAM), Portable Compression Disk Read-Only Memory (Compact Disc Read-Only Memory, CD -ROM), Digital Video Disc (DVD), memory sticks, floppy disks, mechanically encoded devices such as punched cards or raised structures in grooves with instructions stored thereon, and any suitable combination of the foregoing .

Computer readable program instructions or codes described herein may be downloaded from a computer readable storage medium to a respective computing/processing device, or downloaded to an external computer or external storage device over a network, such as the Internet, local area network, wide area network, and/or wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or a network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .

Computer program instructions for performing the operations of the present application may be assembly instructions, instruction set architecture (Instruction Set Architecture, ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, state setting data, or in one or more source or object code written in any combination of programming languages, including object-oriented programming languages—such as Smalltalk, C++, etc., and conventional procedural programming languages—such as the “C” language or similar programming languages. Computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement. In cases involving a remote computer, the remote computer can be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or it can be connected to an external computer such as use an Internet service provider to connect via the Internet). In some embodiments, electronic circuits, such as programmable logic circuits, field-programmable gate arrays (Field-Programmable Gate Array, FPGA) or programmable logic arrays (Programmable Logic Array, PLA), the electronic circuit can execute computer-readable program instructions, thereby realizing various aspects of the present application.

Aspects of the present application are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It should be understood that each block of the flowcharts and/or block diagrams, and combinations of blocks in the flowcharts and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processor of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in flowcharts and/or block diagrams.

It is also possible to load computer-readable program instructions into a computer, other programmable data processing device, or other equipment, so that a series of operational steps are performed on the computer, other programmable data processing device, or other equipment to produce a computer-implemented process , so that instructions executed on computers, other programmable data processing devices, or other devices implement the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.

The flowchart and block diagrams in the figures show the architecture, functions and operations of possible implementations of apparatuses, systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in a flowchart or block diagram may represent a module, a portion of a program segment, or an instruction that includes one or more Executable instructions. In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.

It should also be noted that each block in the block diagrams and/or flowcharts, and combinations of blocks in the block diagrams and/or flowcharts, can be implemented with hardware (such as circuits or ASIC (Application Specific Integrated Circuit, application-specific integrated circuit)), or it can be realized by a combination of hardware and software, such as firmware.

Although the present application has been described in conjunction with various embodiments here, however, in the process of implementing the claimed application, those skilled in the art can understand and Other variations of the disclosed embodiments are implemented. In the claims, the word "comprising" does not exclude other components or steps, and "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that these measures cannot be combined to advantage.

Having described various embodiments of the present application above, the foregoing description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and alterations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principle of each embodiment, practical application or improvement of technology in the market, or to enable other ordinary skilled in the art to understand each embodiment disclosed herein.

Claims

A scheduling runtime comparison method of an operator, characterized in that the method comprises:

Obtain at least two candidate schedules corresponding to the target computing expression, the target computing expression is used to describe the computing logic of the operator, and the candidate scheduling is the available schedule of the operator on the target hardware platform generated based on the target computing expression execute code;

Obtaining a cost comparison model, where the cost comparison model is a model obtained by training a neural network using multiple sample scheduling;

According to the at least two candidate schedules, the output of the cost comparison model is invoked to obtain a cost comparison result, and the cost comparison result is used to indicate the order of the execution durations of the at least two candidate schedules on the target hardware platform.
The method according to claim 1, wherein, according to the at least two candidate schedules, calling the cost comparison model to output the cost comparison result includes:

Preprocessing the at least two candidate schedules to obtain the preprocessed at least two candidate schedules;

inputting the preprocessed at least two candidate schedules into the cost comparison model, and outputting the cost comparison result;

Wherein, the cost comparison model is trained according to at least one set of sample data sets, and each set of sample data sets includes: at least two sample schedules corresponding to sample calculation expressions and pre-marked correct cost comparison results.
The method according to claim 2, wherein the preprocessing the at least two candidate schedules to obtain the preprocessed at least two candidate schedules includes:

For each of the candidate schedules in the at least two candidate schedules, performing feature extraction on the candidate schedules to obtain a feature matrix;

Perform normalization processing on the feature matrix corresponding to the candidate schedule to obtain the preprocessed candidate schedule.
The method according to claim 3, wherein the characteristic matrix is used to indicate at least one of cycle information, input data shape information, calculation code, axis type code and data access type code, and the cycle information includes Information related to the cycle calculation logic of the candidate schedule, the input data shape information is used to describe the input data of the operator, and the calculation code includes the code of the calculation instruction used in the current cycle of the candidate schedule , the axis type encoding includes an axis type encoding, and the data access type encoding includes a data access type encoding.
The method according to any one of claims 2 to 4, wherein said acquisition of the cost comparison model also includes:

Obtain a training sample set, the training sample set includes at least one set of the sample data set;

For each set of sample data sets, preprocessing at least two sample schedules is performed to obtain the preprocessed at least two sample schedules;

inputting the preprocessed at least two sample schedules into an original parameter model to obtain a training result, and the original parameter model is a neural network model;

comparing the training result with the correct cost comparison result to obtain a calculation loss, the calculation loss being used to indicate an error between the training result and the correct cost comparison result;

According to the calculated losses corresponding to the at least one set of sample data groups, the cost comparison model is obtained through training with an error backpropagation algorithm.
The method according to any one of claims 2 to 5, characterized in that, after calling the cost comparison model to output the cost comparison result according to the at least two candidate schedules, further comprising:

adding the at least two candidate schedules and the cost comparison result to the training sample set to obtain an updated training sample set;

The cost comparison model is trained according to the updated training sample set to obtain an updated cost comparison model.
An operator scheduling running time comparison device, characterized in that the device includes:

A first acquisition unit, configured to acquire at least two candidate schedules corresponding to a target calculation expression, the target calculation expression is used to describe the calculation logic of an operator, and the candidate schedule is the operator generated based on the target calculation expression the executable code;

The second acquisition unit is used to acquire a cost comparison model, and the cost comparison model is a model obtained by training a neural network by adopting multiple sample scheduling;

The calling unit is configured to call the output of the cost comparison model according to the at least two candidate schedules to obtain a cost comparison result, and the cost comparison result is used to indicate the order of the execution durations of the at least two candidate schedules.
An operator scheduling running time comparison device, characterized in that the device includes:

processor;

memory for storing processor-executable instructions;

Wherein, the processor is configured to implement the method according to any one of claims 1-6 when executing the instructions.
A non-volatile computer-readable storage medium, on which computer program instructions are stored, wherein, when the computer program instructions are executed by a processor, the method according to any one of claims 1-6 is implemented.
A computer program product, characterized in that, when the computer program product is run on a computer, the computer executes the method according to any one of claims 1-6.