WO2023150912A1 - Operator scheduling operation time comparison method and device, and storage medium - Google Patents
Operator scheduling operation time comparison method and device, and storage medium Download PDFInfo
- Publication number
- WO2023150912A1 WO2023150912A1 PCT/CN2022/075526 CN2022075526W WO2023150912A1 WO 2023150912 A1 WO2023150912 A1 WO 2023150912A1 CN 2022075526 W CN2022075526 W CN 2022075526W WO 2023150912 A1 WO2023150912 A1 WO 2023150912A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- cost comparison
- candidate
- model
- schedules
- scheduling
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 124
- 238000003860 storage Methods 0.000 title claims abstract description 25
- 238000012549 training Methods 0.000 claims abstract description 110
- 230000014509 gene expression Effects 0.000 claims abstract description 87
- 238000013528 artificial neural network Methods 0.000 claims abstract description 35
- 238000012545 processing Methods 0.000 claims abstract description 25
- 238000004364 calculation method Methods 0.000 claims description 97
- 239000011159 matrix material Substances 0.000 claims description 34
- 238000007781 pre-processing Methods 0.000 claims description 17
- 238000000605 extraction Methods 0.000 claims description 15
- 238000004590 computer program Methods 0.000 claims description 14
- 238000004422 calculation algorithm Methods 0.000 claims description 10
- 238000003062 neural network model Methods 0.000 claims description 10
- 238000010606 normalization Methods 0.000 claims description 10
- 238000011156 evaluation Methods 0.000 abstract description 12
- 230000008569 process Effects 0.000 description 38
- 230000006870 function Effects 0.000 description 30
- 238000005457 optimization Methods 0.000 description 27
- 238000010586 diagram Methods 0.000 description 22
- 238000012360 testing method Methods 0.000 description 12
- 238000005516 engineering process Methods 0.000 description 8
- 239000000284 extract Substances 0.000 description 6
- 238000013135 deep learning Methods 0.000 description 5
- 239000013598 vector Substances 0.000 description 5
- 230000008901 benefit Effects 0.000 description 4
- 238000013136 deep learning model Methods 0.000 description 4
- 238000000691 measurement method Methods 0.000 description 4
- 238000011176 pooling Methods 0.000 description 4
- 230000003068 static effect Effects 0.000 description 4
- 230000009466 transformation Effects 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 3
- 238000013507 mapping Methods 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 238000003491 array Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000011478 gradient descent method Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- RYGMFSIKBFXOCR-UHFFFAOYSA-N Copper Chemical compound [Cu] RYGMFSIKBFXOCR-UHFFFAOYSA-N 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000003139 buffering effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 229910052802 copper Inorganic materials 0.000 description 1
- 239000010949 copper Substances 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000004904 shortening Methods 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
Definitions
- the present application relates to the field of data processing, and in particular to a method, device and storage medium for comparing scheduled running time of operators.
- An operator is used to indicate a data processing operation.
- a neural network usually includes a convolution operator and a pooling operator.
- the convolution operator is used to indicate a convolution operation
- the pooling operator is used to indicate a pooling operation. operation.
- the generation process of the operator's executable code is divided into two steps: calculation expression and scheduling.
- Computational expression refers to describing the computational logic of an operator through a specific language, that is, describing the tasks that the operator needs to complete, as well as the input and output of the operator, and then converting the language that describes the computational logic of the operator into an intermediate language.
- the operator's intermediate representation information (also called a template) can be obtained.
- Scheduling refers to scheduling and optimizing the intermediate representation information of operators according to the hardware characteristics of the target hardware platform. Afterwards, the scheduling-optimized intermediate representation information can be converted into executable code recognizable by the target hardware platform.
- Automatic operator optimization is an important function of optimization tools and compilers.
- the difficulty of automatic operator optimization is that it needs to search for the optimal scheduling implementation for a specific hardware platform in the scheduling space formed by massive scheduling. How to evaluate the execution time of different scheduling of operators in the neural network on the hardware platform is the most important thing for the success of optimization.
- a pre-trained cost model can be used to evaluate the absolute execution time of scheduling, so as to realize the evaluation of scheduling running cost.
- the error between the predicted absolute execution time and the real execution time is relatively large, and professionals are required to build a dedicated cost model for a specific hardware platform, which often requires a large amount of training data and a complex model structure.
- due to the relatively large prediction error in this method the uncertainty of cost comparison between schedules with similar predicted values cannot be eliminated.
- a scheduling running time comparison method, device and storage medium of operators are proposed.
- the embodiment of the present application provides a scheduling running time comparison method, device, and storage medium of an operator.
- the relative size of the execution time of different scheduling is directly compared, so as to realize the compiler/automatic
- the automatic tuning function of the optimizer greatly improves the evaluation speed and accuracy of the scheduling operation cost.
- the embodiment of the present application provides a method for comparing the scheduled running time of operators, the method including:
- the target computing expression is used to describe the computing logic of the operator, and the candidate scheduling is the available schedule of the operator on the target hardware platform generated based on the target computing expression execute code;
- the cost comparison model is a model obtained by training a neural network using multiple sample scheduling
- the output of the cost comparison model is invoked to obtain a cost comparison result, and the cost comparison result is used to indicate the order of the execution durations of the at least two candidate schedules on the target hardware platform.
- the cost comparison result used to indicate the sorting of the execution time length, which can realize the automatic tuning function of the compiler/automatic optimizer, and greatly improve the evaluation speed and accuracy of the scheduling operation cost.
- calling the cost comparison model to output the cost comparison result includes:
- the cost comparison model is trained according to at least one set of sample data sets, and each set of sample data sets includes: at least two sample schedules corresponding to sample calculation expressions and pre-marked correct cost comparison results.
- the cost comparison model is trained according to at least one set of sample data sets, the high accuracy of the cost comparison model is guaranteed, and the accuracy of the cost comparison result obtained through the output of the cost comparison model is further guaranteed.
- the preprocessing the at least two candidate schedules to obtain the preprocessed at least two candidate schedules includes:
- the characteristic matrix is used to indicate at least one of cycle information, input data shape information, calculation code, axis type code, and data access type code
- the cycle information includes the Information related to the calculation logic of the candidate scheduling cycle
- the input data shape information is used to describe the input data of the operator
- the calculation code includes the code of the calculation instruction used in the current cycle of the candidate schedule
- the Axis type encoding includes type encoding for operations on the axis
- data access type encoding includes type encoding for accessing data.
- the candidate schedule is converted into a feature matrix, which may include at least one of five types of information, which are indication cycle information, input data shape information, calculation code, axis type code, and data
- the access type is encoded, so that the feature matrix with this data structure is used as the input data of the cost comparison model, which further improves the accuracy of the cost comparison result obtained by the subsequent model output.
- the acquisition cost comparison model also includes:
- the training sample set includes at least one set of the sample data set
- preprocessing at least two sample schedules is performed to obtain the preprocessed at least two sample schedules
- the original parameter model is a neural network model
- the cost comparison model is obtained through training with an error back propagation algorithm.
- a training sample set is obtained, and the training sample set includes at least one set of sample data groups; for each set of sample data sets, at least two sample schedules are preprocessed to obtain preprocessing at least two sample schedules after preprocessing; input at least two sample schedules after preprocessing into the original parameter model to obtain the training result, and the original parameter model is a neural network model; compare the training result with the correct cost comparison result to obtain the calculation loss, and calculate The loss is used to indicate the error between the training result and the correct cost comparison result; according to the calculated losses corresponding to at least one set of sample data groups, the error back propagation algorithm is used to train the cost comparison model, so as to obtain the pre-trained evaluation operator
- the cost comparison model of scheduling running cost ensures the feasibility of the subsequent calling model to realize the scheduling running time comparison method of operators.
- the method further includes:
- the cost comparison model is trained according to the updated training sample set to obtain an updated cost comparison model.
- an updated training sample set is obtained by adding at least two candidate scheduling and cost comparison results to the training sample set; the cost comparison model is trained according to the updated training sample set to obtain the updated cost Compare models, so as to update the cost comparison model in time, and continuously improve the accuracy of the cost comparison model.
- the embodiment of the present application provides an operator scheduling runtime comparison device, the device includes:
- a first acquisition unit configured to acquire at least two candidate schedules corresponding to a target calculation expression, the target calculation expression is used to describe the calculation logic of an operator, and the candidate schedule is the operator generated based on the target calculation expression the executable code;
- the second acquisition unit is used to acquire a cost comparison model, and the cost comparison model is a model obtained by training a neural network by adopting multiple sample scheduling;
- the calling unit is configured to call the output of the cost comparison model according to the at least two candidate schedules to obtain a cost comparison result, and the cost comparison result is used to indicate the order of the execution durations of the at least two candidate schedules.
- the calling unit is also used for:
- the cost comparison model is trained according to at least one set of sample data sets, and each set of sample data sets includes: at least two sample schedules corresponding to sample calculation expressions and pre-marked correct cost comparison results.
- the calling unit is also used for:
- the characteristic matrix is used to indicate at least one of cycle information, input data shape information, calculation code, axis type code, and data access type code
- the cycle information includes the Information related to the calculation logic of the candidate scheduling cycle
- the input data shape information is used to describe the input data of the operator
- the calculation code includes the code of the calculation instruction used in the current cycle of the candidate schedule
- the Axis type encoding includes type encoding for operations on the axis
- data access type encoding includes type encoding for accessing data.
- the device further includes a training unit; the training unit is used for:
- the training sample set includes at least one set of the sample data set
- preprocessing at least two sample schedules is performed to obtain the preprocessed at least two sample schedules
- the original parameter model is a neural network model
- the cost comparison model is obtained through training with an error back propagation algorithm.
- the device further includes an update unit; the update unit is configured to:
- the cost comparison model is trained according to the updated training sample set to obtain an updated cost comparison model.
- the embodiment of the present application provides an operator scheduling runtime comparison device, the device includes:
- memory for storing processor-executable instructions
- the processor is configured to implement the above method when executing the instructions.
- the embodiment of the present application provides a non-volatile computer-readable storage medium, on which computer program instructions are stored, and the above-mentioned method is implemented when the computer program instructions are executed by a processor.
- an embodiment of the present application provides a computer program product, and when the computer program product is run on a computer, the computer executes the above-mentioned method.
- Fig. 1 shows a schematic diagram of a generation process of a scheduling space in the related art.
- Fig. 2 shows a schematic diagram of the principles of the actual measurement method and the cost model method in the related art.
- Fig. 3 shows a schematic structural diagram of a computer device provided by an exemplary embodiment of the present application.
- Fig. 4 shows a flowchart of a method for comparing scheduled running time of operators provided by an exemplary embodiment of the present application.
- Fig. 5 shows a schematic diagram of the principle of a method for comparing scheduled running time of operators provided by an exemplary embodiment of the present application.
- Fig. 6 shows a flow chart of the training process of the cost comparison model provided by an exemplary embodiment of the present application.
- Fig. 7 shows a schematic diagram of a training process of a cost comparison model provided by an exemplary embodiment of the present application.
- Fig. 8 shows a schematic diagram of an input-output curve of a normalization function provided by an exemplary embodiment of the present application.
- Fig. 9 shows a schematic diagram of a network structure of a multi-layer perceptron architecture provided by an exemplary embodiment of the present application.
- Fig. 10 shows a flowchart of a method for comparing scheduled running time of operators provided by another exemplary embodiment of the present application.
- Fig. 11 shows a schematic diagram of a data structure of a feature matrix provided by an exemplary embodiment of the present application.
- Fig. 12 shows a schematic diagram of the application process of the cost comparison model provided by another exemplary embodiment of the present application.
- Fig. 13 shows a block diagram of an apparatus for comparing scheduled runtimes of operators provided by an exemplary embodiment of the present application.
- Deep learning technology establishes a deep learning model and iteratively fits a large amount of historical data (model training), so that the model can establish a mapping relationship between input and output, thereby realizing the prediction of new input data results (model reasoning).
- the deep learning model contains a large number of operators, such as: convolution operator, fully connected operator, pooling operator, etc.
- the whole formed by the stacking and connection of different operators constitutes a deep learning model, also known as a neural network model.
- the topology of the neural network is called the neural network architecture; the parameters of the operators contained in the neural network are model parameters.
- the specific hardware platform may be a central processing unit (Central Processing Unit, CPU), a graphics processing unit (Graphics Processing Unit, GPU), or a neural network processing unit (Neural network Processing Unit, NPU).
- CPU Central Processing Unit
- GPU Graphics Processing Unit
- NPU neural network Processing Unit
- scheduling There are many ways to implement the calculation expression of an operator, which is called scheduling.
- the performance difference of different scheduling on a specific hardware platform can be very large. From a large number of scheduling implementations, using a compiler/automatic optimizer to automatically search for the optimal scheduling for specific hardware can optimize deep learning applications, thereby reducing computing power requirements and increasing system throughput.
- a compiler/automatic optimizer to automatically search for the optimal scheduling for specific hardware can optimize deep learning applications, thereby reducing computing power requirements and increasing system throughput.
- there may be an intermediate expression between calculation expression and scheduling which is called a template.
- Computation expressions can form multiple templates, and each template can generate multiple schedules.
- Automatic optimization of operators is an important function of optimization tools and compilers.
- the performance of operators after automatic optimization determines whether deep learning models can be efficiently applied and meet product requirements.
- the difficulty of operator automatic optimization is that it needs to search for the optimal scheduling implementation for a specific hardware platform in the scheduling space formed by massive scheduling. How to evaluate the execution time of different scheduling of operators in the neural network on the hardware platform is the most important thing for the success of optimization.
- the "cost" mentioned in this article refers to the execution time of scheduling on the hardware platform. In order to evaluate the execution time of a scheduler on a hardware platform, there are currently two main methods: the actual measurement method and the cost model method.
- the actual measurement method refers to the code generation of each schedule, the code compilation, and then running on the hardware.
- the specific execution time is obtained by measuring the running time, which requires a complete compilation process. Its disadvantage is that it takes a long time to evaluate the scheduling (above second level), which takes too long in the actual 100,000 or millions of scheduling space scenarios; limited by the search time, it is difficult to explore a larger scheduling space.
- the cost model method refers to evaluating the execution time of scheduling by establishing a cost model. Because this method does not need to go through the process of compiling, running and measuring, it has a very obvious advantage in terms of time-consuming evaluation.
- the methods based on the cost model all use the absolute execution time of the forecasted scheduling to realize the evaluation of the scheduling operation cost.
- the error between the predicted absolute execution time and the real execution time is relatively large, and professionals are required to build a dedicated cost model for a specific hardware platform, which often requires a large amount of training data and a complex model structure.
- the uncertainty of cost comparison between schedules with similar predicted values cannot be eliminated.
- the above shortcomings limit the application of the cost model method in the related art in the actual optimization process.
- the embodiment of the present application provides a scheduling running time comparison method, device, and storage medium of an operator.
- the relative size of the execution time of different scheduling is directly compared, so as to realize the compiler/automatic
- the automatic tuning function of the optimizer greatly improves the evaluation speed and accuracy of the scheduling operation cost.
- the operator scheduling runtime comparison method provided by the embodiment of this application has strong advantages in speed and accuracy, improves the performance of the operator optimizer and significantly reduces the evaluation time.
- Computation expression refers to the whole composed of operator input data, output data and calculation logic.
- a calculation expression is an instance that describes a specific calculation process.
- the calculation expression can be user-defined, and the calculation expression is used to complete all the information of the calculation logic functions required by the user.
- the form of computational expression is usually in the form of pseudocode or structured flowchart, which is easy to write but not optimized.
- Template Computational expressions can be transformed into templates through a series of equivalent transformations.
- the template is the intermediate representation information between the calculation expression and the scheduling during the optimization process of the calculation expression structure. Generally speaking, the template determines the order of calculation execution in the calculation expression logic and the mode of data access.
- the template changes the calculation execution sequence and data access mode of the calculation expression, but does not restrict how the input data of the calculation expression is divided. For example, after the cycle is transformed by axis division, a single cycle can be divided into several sub-cycles, and the division of different numbers of sub-cycles is different templates. In each template, the loop upper and lower bounds of the sub-loop only need to be equivalent to the calculation expression, but the value of the loop upper and lower bounds of each sub-loop is uncertain.
- Schedule According to the hardware characteristics of the target hardware platform, the intermediate representation information of the operator is scheduled and optimized. Scheduling determines the specific expression of all variable parameters in the template, which can be transformed into a description of the calculation expression implemented by software. For the same input data, the scheduled output data is exactly the same as the output data of the calculation expression, but the calculation process can be different.
- Feature embedding the intermediate output of the input data after passing through the neural network module.
- Feature embedding is the mapping of the neural network module to the input data in another space, including the extraction, enhancement and encoding of the input data.
- Multi-layer perceptron a basic unit of neural network composed of fully connected layers, activation layers, etc. Multilayer perceptrons can form an overall neural network architecture, or they can appear as modules within a part of an overall architecture.
- the generation process of the scheduling space is shown in Figure 1.
- the computer equipment (such as an automatic optimizer) obtains the calculation expression input by the user, and performs the calculation expression Transformation generates a template space, and the templates in the template space can be transformed into scheduling implementations whose logic is equivalent to computational expressions.
- the set of valid schedules forms the schedule space.
- the computer device searches in the scheduling space and outputs the optimal scheduling realization.
- the calculation is expressed as a user-defined loop calculation, and the calculation logic in the loop body is represented by a statement (stmt).
- the upper and lower bounds of the loop are from 0 to 546756.
- the single loop of calculation expression can be equivalently transformed into double nested loop, triple nested loop to N-fold nested loop template.
- the loop upper and lower bounds of each nested loop are not determined. Axis splits allow for different plans for data access patterns.
- the formed code that expresses the logic equivalent to calculation is scheduling.
- the calculation expression composed of complex calculation logic can usually derive tens of millions of schedule implementations, and the execution time of different schedules on the target hardware platform can vary by hundreds or thousands of times.
- the automatic optimization system uses a series of operations to search for hardware to perform the optimal scheduling in the search space formed by massive scheduling to optimize operators.
- the actual measurement method generates legal code according to the definition of scheduling, compiles it with a compiler, executes and measures it on the hardware, and obtains performance evaluation results.
- the result is usually the execution time of the schedule, and can also be a count of hardware clock cycles required to run.
- the scheduling with the shortest execution time (smallest hardware clock cycle count) is finally selected.
- the execution time obtained in this way is the most accurate and true.
- the disadvantage is that the code generation and compilation process usually takes several seconds to several minutes to complete. The time consumption of the operation and measurement process depends on the calculation amount and complexity of the operator. The optimization process is very slow.
- the search and selection method based on the machine learning model can greatly accelerate the above process, shortening the process of code generation, compilation and running in a few seconds to the neural network reasoning process in milliseconds. At the same time, due to the limitation of the accuracy of model prediction, the effect of optimal selection may decline.
- the current cost model is as described above. After feature extraction is performed on the schedule, the absolute execution time or number of running cycles of the schedule is predicted by calling the cost model.
- the accuracy of the cost model in this method is low, and the cost model in the related art has an average error of 16% in the prediction of the scheduling execution time, and the actual execution time difference between quite a lot of scheduling is less than the 16% error value; except for the error
- the cost of obtaining the cost model is high, training requires 1.8 million training data, the complexity of the network architecture is high, and the training convergence time is long. Therefore, this method cannot well achieve the purpose of fast and accurate operator search optimization.
- the embodiment of the present application provides a new cost model: a cost comparison model.
- the cost comparison model avoids directly predicting the scheduling execution time, and transforms the regression problem into a classification problem that is easy for neural network to learn.
- the cost comparison model takes at least two candidate schedules as input, and the output result is a cost comparison result, and the cost comparison result is used to indicate the order of the execution time of the at least two candidate schedules on the target hardware platform.
- the method provided by the embodiment of the present application has the advantages of high accuracy, fast inference speed, and lower training cost than existing methods. In the operator optimization process, the cost comparison model provided by the embodiment of the present application can quickly compare the execution time of different schedules, thereby realizing large-scale search optimization of operators.
- the operator scheduling runtime comparison method provided in the embodiment of the present application can be applied to the optimization process of the automatic operator optimization system.
- the core content of the embodiment of the present application is the cost comparison model, including the model architecture, model training process and model application process of the cost comparison model.
- the operator scheduling runtime comparison method provided in the embodiment of the present application can be applied to a specific computer device (such as CPU or GPU or NPU), and performs large-scale comparison and search on multiple candidate scheduling implementations of the target computing expression, so as to obtain the optimal Optimal scheduling to achieve the purpose of optimizing the target computing expression on a specific computer device.
- the execution subject of the method for comparing the scheduled running time of operators provided in the embodiment of the present application is a computer device, which may be a general-purpose computer device or a special-purpose computing device. Please refer to FIG. 3 , which shows a schematic structural diagram of a computer device provided by an exemplary embodiment of the present application.
- the computer device may be a terminal or a server.
- Terminals include tablet computers, laptop computers, and desktop computers, among others.
- the server can be one server, or a server cluster composed of several servers, or a cloud computing service center.
- the computer device includes a processor 10 , a memory 20 and a communication interface 30 .
- a processor 10 the structure shown in FIG. 1 does not constitute a limitation to the computer device, and may include more or less components than those shown in the illustration, or combine some components, or arrange different components. in:
- the processor 10 is the control center of the computer equipment, and uses various interfaces and lines to connect various parts of the entire computer equipment, by running or executing software programs and/or modules stored in the memory 20, and calling data stored in the memory 20 , to perform various functions of the computer equipment and process data, thereby controlling the computer equipment as a whole.
- the processor 10 may be implemented by a CPU, or may be implemented by a GPU.
- the memory 20 can be used to store software programs as well as modules.
- the processor 10 executes various functional applications and data processing by executing software programs and modules stored in the memory 20 .
- the memory 20 can mainly include a program storage area and a data storage area, wherein the program storage area can store an operating system 21, a first acquisition unit 22, a second acquisition unit 23, a detection unit 24 and at least one functionally required application program 25 (such as neural network training, etc.); the storage data area can store data created according to the use of computer equipment, etc.
- Memory 20 can be realized by any type of volatile or nonvolatile memory device or their combination, such as Static Random Access Memory (Static Random Access Memory, SRAM), Electrically Erasable Programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), Erasable Programmable Read Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read Only Memory (Read Only Memory, ROM), magnetic memory, flash memory, magnetic or optical disk.
- SRAM Static Random Access Memory
- EEPROM Electrically Erasable Programmable Read-Only Memory
- EPROM Erasable Programmable Read Only Memory
- PROM Programmable Read-Only Memory
- Read Only Memory Read Only Memory
- magnetic memory flash memory
- flash memory magnetic or optical disk.
- the memory 20 may also include a memory controller to provide the processor 10 with access to the memory 20 .
- the processor 20 executes the following function by running the first acquisition unit 22: acquire at least two candidate schedules corresponding to the target calculation expression, the target calculation expression is used to describe the calculation logic of the operator, and the candidate schedule is based on the The executable code of the operator generated by the target calculation expression; the processor 20 performs the following functions through the second acquisition unit 23: acquire a cost comparison model, and the cost comparison model is obtained by training the neural network using multiple sample scheduling Model; the processor 20 performs the following functions through the calling unit 24: according to the at least two candidate schedules, call the cost comparison model output to obtain a cost comparison result, and the cost comparison result is used to indicate the execution duration of the at least two candidate schedules sorted by size.
- the computer device obtains the calculation expression code input by the user, that is, the target calculation expression, analyzes the target calculation expression through the operator optimization system, generates a template space based on optimization rules or polyhedron models, and generates a large number of A legal candidate schedule, the generated multiple candidate schedules form a scheduling space.
- An instance in the scheduling space represents a legal scheduling.
- the cost comparison model provided by the embodiment of this application is used as an evaluation module to compare and output at least two candidate schedulings input to obtain the cost comparison result, so as to realize the search for the optimal scheduling space. The target of the schedule.
- FIG. 4 shows a flowchart of a method for comparing scheduled running time of operators provided by an exemplary embodiment of the present application. This embodiment is described by taking the method for comparing the scheduling running time of the operator applied to the computer device shown in FIG. 3 as an example.
- the scheduling running time comparison methods of this operator include:
- Step 401 Acquire at least two candidate schedules corresponding to the target computing expression.
- the target computing expression is used to describe the computing logic of the operator.
- the candidate schedule is the executable code of the operator generated based on the target computing expression on the target hardware platform.
- the computer device acquires at least two candidate schedules from the schedule space corresponding to the target computation expression.
- the computer device obtains the input target calculation expression, analyzes the target calculation expression, generates a template space according to a preset method, generates multiple candidate schedules by instantiating the template, and the generated multiple candidate schedules constitute a scheduling space.
- the computer device obtains at least two candidate schedules from the schedule space.
- the preset method is a dynamic programming method or an optimization rule method or a polyhedron model method.
- the preset methods may also be different.
- the embodiment of the present application does not limit the generation algorithm of the scheduling space. The embodiment of the present application can only be applied if at least two candidate schedules are included in the scheduling space for comparison.
- the target calculation expression is a specific calculation expression, for example, the target calculation expression is an input calculation expression.
- the candidate scheduler is the executable code of the operator generated based on the target computing expression on the target hardware platform.
- the target hardware platform is CPU, GPU or NPU. This embodiment of the present application does not limit it.
- the computer device when it receives the preset acquisition instruction, it acquires at least two candidate schedules corresponding to the target computing expression. Alternatively, the computer device acquires at least two candidate schedules corresponding to the target computing expression every preset time interval. Alternatively, the computer device acquires at least two candidate schedules corresponding to the target computing expression in real time.
- the preset time interval is a default setting or a custom setting, which is not limited in this embodiment.
- Step 402 acquiring a cost comparison model, which is a model obtained by training a neural network by using multiple sample scheduling.
- the computer device obtains the trained cost comparison model.
- the terminal obtains a trained cost comparison model stored by itself, or obtains a trained cost comparison model from a server.
- the server obtains a trained cost comparison model stored in itself.
- the cost comparison model is a model obtained by training the neural network by using at least two sample scheduling and correct cost comparison results. That is, the cost comparison model is determined according to at least two sample scheduling and correct cost comparison results.
- the correct cost comparison result is a pre-marked correct cost comparison result corresponding to at least two sample schedules.
- the neural network of the cost comparison model can adopt an end-to-end stacked multi-layer perceptron architecture.
- Other reasonable deformation architectures can also achieve the fitting function of the cost comparison model, and different architectures will affect the final accuracy of the model.
- Any network architecture formed by deformation, derivation, and layer replacement of this architecture should be regarded as equivalent to the neural network described in the embodiments of this application.
- the neural network is a deep neural network (Deep Neural Network, DNN).
- the neural network is a Convolutional Neural Network (CNN).
- the neural network is a Recurrent Neural Network (RNN). This embodiment of the present application does not limit it.
- the cost comparison model is a neural network model that identifies relative execution times of at least two candidate schedules on the target hardware platform.
- the cost comparison model is used to convert the input at least two candidate schedules into cost comparison results.
- the cost comparison result is used to indicate the ranking of the execution durations of the at least two candidate schedules on the target hardware platform.
- the cost comparison model is used to represent the correlation between at least two candidate schedules and the cost comparison results.
- the cost comparison model is a preset mathematical model, and the cost comparison model includes model coefficients between at least two candidate schedules and cost comparison results.
- the model coefficient can be a fixed value, or a value that is dynamically modified over time, or a value that is dynamically modified according to a usage scenario.
- Step 403 According to the at least two candidate schedules, invoke the cost comparison model output to obtain a cost comparison result, and the cost comparison result is used to indicate the order of execution duration of the at least two candidate schedules on the target hardware platform.
- the computer device performs preprocessing on at least two candidate schedules to obtain at least two preprocessed candidate schedules; input the preprocessed at least two candidate schedules into the cost comparison model, and output the cost comparison results.
- the cost comparison result is used to indicate the order of the execution durations of at least two candidate schedules on the target hardware platform. That is, the cost comparison result does not indicate the absolute execution time of the at least two candidate schedules on the target hardware platform, but indicates the relative size of the execution time of the at least two candidate schedules on the target hardware platform.
- the cost comparison result is coding information of a comparison result of the predicted execution durations of at least two candidate schedules.
- the computer device decodes the coded information output by the cost comparison model, and obtains the order of the execution durations of at least two candidate schedules, that is, the comparison result.
- the cost comparison result includes encoding information, and the value of the encoding information is in one-to-one correspondence with the execution duration comparison results of at least two candidate schedules.
- the encoded information is a first value
- the encoding When the information is the second value, it is used to indicate that the execution duration of the first candidate schedule is equal to the execution duration of the second candidate schedule
- the encoded information is the third value, it is used to indicate that the execution duration of the first candidate schedule is greater than the execution duration of the second candidate schedule , where the first, second, and third values are different.
- the computer device selects the candidate schedule with the shortest execution time among the at least two candidate schedules as the target schedule according to the cost comparison results of the at least two candidate schedules, retains the target schedule, and discards other candidate schedules except the target schedule.
- the computer device takes any one of the at least two candidate schedules as the target schedule, retains the target schedule, and discards all but the target schedule. other candidate schedules.
- the embodiment of the present application does not limit the method of retaining and discarding the scheduling.
- the computer device obtains the input target computing expression, analyzes the target computing expression, generates a template space according to a preset method, generates multiple candidate schedules by instantiating the template, and generates A plurality of candidate schedules constitute a scheduling space.
- Preprocess the schedule A and schedule B to obtain the preprocessed schedule A and schedule B; input the preprocessed schedule A and schedule B into the cost comparison model to output encoded information, and decode the encoded information to obtain schedule A Compare the result with the cost of scheduling B.
- the encoding information when the encoding information is 001, it is used to indicate that the execution duration of schedule A is less than that of schedule B, and schedule A is retained, and schedule B is discarded; when the encoding information is 002, it is used to indicate that the execution duration of schedule A is equal to the execution duration of schedule B , keep schedule A or schedule B; when the encoding information is 100, it indicates that the execution time of schedule A is longer than that of schedule B, keep schedule B and discard schedule A.
- the embodiment of the present application obtains at least two candidate schedules corresponding to the target computing expression, and directly compares the relative execution time of the at least two candidate schedules on the target hardware platform according to the call cost comparison model of the at least two candidate schedules. Size, so as to output the cost comparison result used to indicate the size of the execution time, which can realize the automatic tuning function of the compiler/automatic optimizer, and greatly improve the evaluation speed and accuracy of the scheduling operation cost.
- the computer device acquires the cost comparison model, it needs to train the training sample set to obtain the cost comparison model.
- the training process of the cost comparison model is introduced below.
- the training process of the cost comparison model includes the following steps:
- step 601 a training sample set is obtained, and the training sample set includes at least one set of sample data groups.
- the cost comparison model is trained according to at least one set of sample data sets, and each set of sample data sets includes: at least two sample schedules corresponding to sample calculation expressions and pre-marked correct cost comparison results.
- Step 602 for each set of sample data groups, perform preprocessing on at least two sample schedules to obtain at least two preprocessed sample schedules.
- the computer device For each sample data group, the computer device performs feature extraction on each sample schedule in at least two sample schedules to obtain a feature matrix, and normalizes the feature matrix corresponding to the sample schedule to obtain a preprocessed sample schedule.
- feature extraction is the process of extracting features from sample schedules and converting features into structured data.
- Step 603 input the preprocessed at least two sample schedules into the original parameter model to obtain a training result, and the original parameter model is a neural network model.
- the original parameter model is established according to the neural network model, for example: the original parameter model is established according to the DNN model.
- the computer device creates an input-output pair corresponding to the set of sample data sets, the input parameter of the input-output pair is at least two sample schedules in the set of sample data sets, and the target parameter is the set
- the correct cost comparison results in the sample data set; the computer equipment inputs the input parameters into the original parameter model to obtain the training result.
- input-output pairs are represented by feature vectors.
- Step 604 comparing the training result with the correct cost comparison result to obtain a calculation loss, which is used to indicate the error between the training result and the correct cost comparison result.
- the training result is the coding information output by the original parameter model, and the correct cost comparison result is the pre-marked coding information.
- the encoded information is information encoded by one-hot code (One-Hot).
- the calculation loss is represented by cross entropy.
- Step 605 According to the calculated losses corresponding to at least one set of sample data sets, use the error back propagation algorithm to train to obtain a cost comparison model.
- the computer device determines the gradient direction of the cost comparison model according to the calculation loss through the back propagation algorithm, and updates the model parameters in the cost comparison model layer by layer from the output layer of the cost comparison model.
- the computer equipment extracts two schedules from the schedule space, that is, schedules A and B, as input data for cost comparison model training. Compare the relative execution time of the two schedules on the target hardware platform, and use one-hot encoding to generate (A, B) input encoding information (that is, the correct cost comparison result) as the target parameter of the backpropagation algorithm.
- the encoding information is shown in the table one shown.
- the encoding information When the encoding information is the first value, it is used to indicate that the execution duration of schedule A is less than the execution duration of schedule B; when the encoding information is the second value, it is used to indicate that the execution duration of schedule A is equal to the execution duration of schedule B, and the encoding information is the third value When is used to indicate that the execution duration of schedule A is longer than that of schedule B.
- the computer equipment performs feature extraction on schedule A and schedule B to obtain respective corresponding feature matrices.
- the feature matrices of scheduling A and scheduling B are two 250x57-dimensional matrices.
- Part of the column data in the feature matrix is normalized to limit its dynamic range.
- the formula of the normalization function is as follows:
- v is the input data
- v* is the output data.
- FIG. 8 Schematically, the input-output curve of the normalization function is shown in FIG. 8 .
- the abscissa is the above-mentioned input data
- the ordinate is the above-mentioned output data.
- the computer equipment inputs the normalized schedule A to the feature embedding module A composed of the multi-layer perceptron, that is, DNN_A, outputs a 1x512-dimensional schedule embedding (schedule embedding) A, and inputs the normalized schedule B to at most
- the feature embedding module B composed of layer perceptrons is DNN_B, which outputs a 1x512-dimensional scheduling embedding B.
- the two schedule embeddings are bitwise subtracted, that is, the schedule embedding A is subtracted from the schedule embedding B to obtain a schedule difference embedding.
- the mean square error loss function (or minimum square error function) is used as the loss function to calculate the calculation loss of the model for the current input.
- the calculation loss is backpropagated through the gradient descent method, and the model parameters based on the neural network module such as DNN_A, DNN_B, and DNN_CLS are updated. Repeat the above steps to train the training sample set for multiple periods (for example, 30 periods) until the model converges.
- the network structure of DNN_A, DNN_B, and DNN_CLS may be an end-to-end stacked multi-layer perceptron architecture, and the network structure is shown in FIG. 9 . Among them, the number represents the number of neurons in each layer, and the Relu function is used as the activation function between each fully connected layer.
- FIG. 10 shows a flowchart of a method for comparing scheduling runtimes of operators provided by another exemplary embodiment of the present application. This embodiment is described by taking the method for comparing the scheduling running time of the operator applied to the computer device shown in FIG. 3 as an example.
- the scheduling running time comparison methods of this operator include:
- Step 1001 obtain at least two candidate schedules from the schedule space corresponding to the target computation expression.
- the target computing expression is used to describe the computing logic of the operator
- the candidate schedule is the executable code of the operator generated based on the target computing expression on the target hardware platform.
- the computer device acquires an input target computing expression, analyzes the target computing expression, generates a template according to a preset method, and determines a scheduling space, where the scheduling space includes at least two candidate schedulings generated by instantiating the template.
- the computer device acquires at least two candidate schedules from the schedule space corresponding to the target computation expression.
- the scheduling space includes n candidate schedulings.
- each pairwise comparison is adopted, and the optimal method is kept for n-1 comparisons to obtain the optimal target scheduling.
- choose the dichotomous method for comparison for example, n is 8, that is, divide the 8 schedules into 4 groups in pairs, and select 4 candidate schedules with the fastest running speed from the 4 groups through the cost comparison model for secondary Grouping, the secondary grouping is divided into 2 groups, and 2 comparisons are required. After the comparison is completed, the 2 optimal candidate schedulings are reserved for final comparison, so as to obtain the optimal target scheduling among the 8 candidate schedulings.
- the embodiment of the present application does not limit the scheduling group comparison method.
- Step 1002 for each of the at least two candidate schedules, perform feature extraction on the candidate schedules to obtain a feature matrix.
- the computer device extracts multiple types of information from m cycles of the candidate schedule and combines them into a vector, which is a feature matrix corresponding to the candidate schedule, where m is a positive integer.
- a vector which is a feature matrix corresponding to the candidate schedule, where m is a positive integer.
- the combined vector size is 1x57.
- a maximum of 250 loops of information are supported, and finally assembled into a 250x57 two-dimensional feature matrix.
- the number of supported loops can vary according to actual needs, which is not limited in this embodiment of the present application.
- the feature matrix is used to indicate at least one of cycle information, input data shape information, calculation encoding, axis type encoding, and data access type encoding.
- the round-robin information includes information related to the round-robin calculation logic of the candidate schedule.
- the cycle information is cycle information at a level in the scheduling, for example, the size of the cycle information is 1x6.
- the loop information includes: at least one of loop depth, nesting level, block number, flag indicating whether it is the last loop, quotient of loop depth, and remainder of loop depth. Among them, the quotient of the loop depth and the loop depth needs to be normalized.
- the input data shape information is used to describe the input data of the operator.
- the size of the input data shape information is 1x10.
- the operator is a single-input operator, a double-input operator, or a multi-input operator.
- the shape information of the input data includes: shape information corresponding to k input data, k is a positive integer, and the shape information includes at least one of batch size, number of channels, height, width, and minimum number of channels.
- the computation encoding includes the encoding of the computation instruction used in the current cycle of the candidate schedule.
- the size of the calculation code is 1x6.
- the computing code includes: at least one of memory access types, program instructions, data types, storage units, and identifiers for indicating whether to use double buffering.
- Axis type encodings include encodings for the types of operations on the axes.
- the size of the axis type code is 1x15.
- Axis type codes are used to indicate at least one operation among extended, normalized axes.
- the data access type encoding includes the type encoding of the access to the data.
- the size of the data access type encoding is 1x19.
- the data access type code is used to indicate at least one access among write data, read data, allocation, and pragma.
- feature extraction is performed on candidate schedules to obtain a feature matrix, and the data structure of the feature matrix is shown in FIG. 11 . Extract multiple types of information from each cycle of candidate scheduling and combine them into vectors. The size of the combined vector is 1x57, and it supports up to 250 cycles of information.
- the feature matrix is used to indicate Loop information, input data shape information, calculation encoding, axis type encoding and data access type encoding
- the size of loop information is 1x6
- the size of input data shape information is 1x10
- the size of calculation encoding is 1x6
- the size of axis type encoding is 1x15
- the size of the data access type encoding is 1x20.
- Step 1003 for each of the at least two candidate schedules, normalize the feature matrix corresponding to the candidate schedules to obtain a preprocessed candidate schedule.
- Step 1004 input the preprocessed at least two candidate schedules into the trained cost comparison model, output the cost comparison result, and the cost comparison result is used to indicate the execution duration of the at least two candidate schedules on the target hardware platform Sort.
- the computer device acquires a trained cost comparison model, and the cost comparison model is a model obtained by training a neural network by using multiple sample scheduling.
- the computer device inputs the preprocessed at least two candidate schedules into the trained cost comparison model, outputs a cost comparison result, and the cost comparison result is used to indicate the order of the execution time of the at least two candidate schedules on the target hardware platform
- the computer device adds at least two candidate scheduling and cost comparison results to the training sample set to obtain an updated training sample set; train the cost comparison model according to the updated training sample set to obtain an updated cost comparison Model.
- the computer equipment extracts two schedules A and B from the schedule space, and extracts features from the schedule A and schedule B to obtain their corresponding feature matrices.
- the feature matrices of scheduling A and scheduling B are two 250x57-dimensional matrices. Part of the column data in the feature matrix is normalized to limit its dynamic range. The way of normalization can be compared with the description of normalization in the above model training process, and will not be repeated here.
- the computer equipment inputs the normalized scheduling A to the feature embedding module A composed of the multi-layer perceptron, that is, DNN_A, outputs a 1x512-dimensional scheduling embedding A, and inputs the normalized scheduling B to the multi-layer perceptron.
- the feature embedding module B of DNN_B outputs a 1x512-dimensional scheduling embedding B.
- the two scheduling embeddings are bitwise subtracted, that is, the scheduling embedding A is subtracted from the scheduling embedding B to obtain the scheduling difference embedding.
- DNN_A, DNN_B, and DNN_CLS can refer to the relevant description in the above model training process by analogy, and will not be repeated here.
- the computer equipment converts the outputted three-digit encoded information into a one-hot coded label format.
- the embodiment of the present application also performs feature extraction on at least two candidate schedules, maps the schedules to its unique corresponding matrix expression form, and obtains the feature matrix expressions of at least two candidate schedules; for the two feature matrix expressions Do normalization processing; the cost comparison model based on the deep neural network takes at least two preprocessed feature matrices as input, and the output is the coding information of the comparison result of the predicted execution time of at least two candidate schedules; the cost comparison The encoded information output by the model is decoded to obtain the comparison result of the execution time of at least two candidate schedules, that is, the execution time of different schedule implementations of the same calculation expression on a specific hardware platform is compared through the deep learning network model, thereby replacing the schedule implementation process.
- the process of running and measuring on the hardware after the compilation process solves the problem of slow speed in large-scale search of operator automatic optimization systems such as automatic optimizers/compilers.
- the cost comparison model is implemented with the goal of predicting how fast operators will take to execute.
- the training sample set includes 20792 schedules from 32 operators, and each operator contains different schedules. For scheduling belonging to the same operator, perform pairwise pairing to form a training instance set, compare the execution time of the two operators after pairing, and generate the target of the paired training instance according to the above-mentioned related method.
- Example Extract schedule A and schedule B the actual execution time of schedule A is 15 seconds, and the actual execution time of schedule B is 17 seconds, then (A, B) is a training instance, and the time of 15 seconds is less than 17 seconds, (A, B)
- the target encoding for this training instance is 001.
- the pairwise combination of schedules belonging to the same operator may include a combination of a schedule and the schedule itself in the training sample set, and the target code of the formed training instance is 010. They belong to the same operator scheduling combination, and the combination is sensitive to the order. For example, the (A, B) combination is different from the (B, A) combination. If the execution time of A and B is different, the (A, B) combination and (B, A) The combined target encodings are also different. If an operator contains N (N>2) schedules, then the combination of two pairs can form N square training instances. This combination of training, even if the amount of training data is relatively limited, can also build a relatively large training data set.
- the neural network model adopts batch training, 5000 training examples are input for each iteration, the learning rate is set to 10e-8, and the momentum stochastic gradient descent method is used to train the complete training example set for multiple periods (for example, 30 periods).
- the test set includes 46022 test instances, and each test instance is composed of two schedules belonging to the same operator. Any schedule used to generate test instances is not included in the schedule set for generating training instances.
- the test target code is generated by the above-mentioned related method for the test instance.
- the prediction result output by the network passes the maximum parameter (argmax) function, if it completely matches the test target code, it is recorded as the test instance is correctly predicted by the network.
- Accuracy is defined as: the number of test instances correctly predicted by the network/the total number of test instances tested. Tested on 46022 test cases, the method correctly predicts 41242 test cases with an accuracy rate of 89.61%.
- the embodiment of the present application provides a scheduling running time comparison method for operators, which adopts the idea of cost comparison to determine the comparison result of the relative execution time of at least two schedules, and applies the cost comparison model to the operator automatic
- the optimization process of the optimization system also involves a modeling method of the cost comparison model that can be applied to the operator automatic optimization system, including the model architecture design, model training and model reasoning application process, and the model training and model
- scheduling can be converted into a special data structure through feature extraction, and the normalization processing of data and the expression of output format have high accuracy, fast inference speed, and the required training cost is lower than that of existing methods. low pros.
- the higher accuracy of the cost comparison model is guaranteed; on the other hand, the reasoning speed of the cost comparison model is improved, and it only takes 3 milliseconds to compare a set of instances; on the other hand, the cost comparison model training requires The amount of data and computing power are relatively small, and 30 sessions of training on more than 49 million training instances are completed in 70 hours on a single GPU card. Through the cost comparison model, the code optimizer/compiler automatic tuning only needs to consider how to improve the accuracy of the cost comparison model.
- the cost model in related technologies Compared with the cost model that predicts the absolute execution time of scheduling in related technologies, in addition to the accuracy of model prediction , the cost model in related technologies also needs to consider how to deal with the boundary problems caused by errors, for example: if the difference between the predicted running times of two schedules is smaller than the error predicted by the model, the absolute value model cannot give a high-confidence prediction at this time.
- FIG. 13 shows a block diagram of an apparatus for comparing scheduled runtimes of operators provided by an exemplary embodiment of the present application.
- the apparatus can be implemented as all or a part of the computer equipment provided in FIG. 3 through software, hardware or a combination of the two.
- the apparatus may include: a first obtaining unit 1310 , a second obtaining unit 1320 and a calling unit 1330 .
- the first acquisition unit 1310 is configured to acquire at least two candidate schedules corresponding to the target calculation expression, the target calculation expression is used to describe the calculation logic of the operator, and the candidate schedule is the executable code of the operator generated based on the target calculation expression;
- the second acquisition unit 1320 is configured to acquire a cost comparison model, where the cost comparison model is a model obtained by training a neural network using multiple sample scheduling;
- the calling unit 1330 is configured to call the output of the cost comparison model according to the at least two candidate schedules to obtain a cost comparison result, and the cost comparison result is used to indicate the order of the execution durations of the at least two candidate schedules.
- the calling unit 1330 is also used to:
- the cost comparison model is trained according to at least one set of sample data sets, and each set of sample data sets includes: at least two sample schedules corresponding to sample calculation expressions and pre-marked correct cost comparison results.
- the calling unit 1330 is also used to:
- the feature matrix corresponding to the candidate schedule is normalized to obtain the preprocessed candidate schedule.
- the feature matrix is used to indicate at least one of cycle information, input data shape information, calculation encoding, axis type encoding, and data access type encoding
- the cycle information includes the cycle calculation logic of the candidate schedule
- the input data shape information is used to describe the input data of the operator
- the calculation code includes the code of the calculation instruction used in the current cycle of the candidate schedule
- the axis type code includes the type code for operating on the axis
- the data access type code Include type encodings for accessing data.
- the device further includes a training unit; the training unit is used for:
- training sample set includes at least one set of sample data sets
- At least two sample schedules are preprocessed to obtain at least two sample schedules after preprocessing;
- the original parameter model is a neural network model
- an error backpropagation algorithm is used to train the cost comparison model.
- the device further includes an update unit; the update unit is used for:
- the cost comparison model is trained according to the updated training sample set to obtain an updated cost comparison model.
- the division of the above-mentioned functional modules is used as an example for illustration. In practical applications, the above-mentioned function allocation can be completed by different functional modules according to the needs.
- the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
- the device and the method embodiment provided by the above embodiment belong to the same idea, and the specific implementation process thereof is detailed in the method embodiment, and will not be repeated here.
- An embodiment of the present application provides an operator scheduling runtime comparison device, the operator scheduling runtime comparison device includes: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to execute The instructions implement the methods executed by the computer device in the above-mentioned embodiments.
- An embodiment of the present application provides a computer program product, including computer-readable codes, or a non-volatile computer-readable storage medium carrying computer-readable codes.
- the processor executes the method performed by the computer device in the foregoing embodiments.
- An embodiment of the present application provides a non-volatile computer-readable storage medium, on which computer program instructions are stored.
- the computer program instructions are executed by a processor, the methods performed by the computer device in the foregoing embodiments are implemented.
- a computer readable storage medium may be a tangible device that can retain and store instructions for use by an instruction execution device.
- a computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
- Non-exhaustive list of computer-readable storage media include: portable computer disk, hard disk, random access memory (Random Access Memory, RAM), read only memory (Read Only Memory, ROM), erasable Electrically Programmable Read-Only-Memory (EPROM or flash memory), Static Random-Access Memory (Static Random-Access Memory, SRAM), Portable Compression Disk Read-Only Memory (Compact Disc Read-Only Memory, CD -ROM), Digital Video Disc (DVD), memory sticks, floppy disks, mechanically encoded devices such as punched cards or raised structures in grooves with instructions stored thereon, and any suitable combination of the foregoing .
- RAM Random Access Memory
- ROM read only memory
- EPROM or flash memory erasable Electrically Programmable Read-Only-Memory
- Static Random-Access Memory SRAM
- Portable Compression Disk Read-Only Memory Compact Disc Read-Only Memory
- CD -ROM Compact Disc Read-Only Memory
- DVD Digital Video Disc
- Computer readable program instructions or codes described herein may be downloaded from a computer readable storage medium to a respective computing/processing device, or downloaded to an external computer or external storage device over a network, such as the Internet, local area network, wide area network, and/or wireless network.
- the network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
- a network adapter card or a network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
- Computer program instructions for performing the operations of the present application may be assembly instructions, instruction set architecture (Instruction Set Architecture, ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, state setting data, or in one or more source or object code written in any combination of programming languages, including object-oriented programming languages—such as Smalltalk, C++, etc., and conventional procedural programming languages—such as the “C” language or similar programming languages.
- Computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement.
- the remote computer can be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or it can be connected to an external computer such as use an Internet service provider to connect via the Internet).
- electronic circuits such as programmable logic circuits, field-programmable gate arrays (Field-Programmable Gate Array, FPGA) or programmable logic arrays (Programmable Logic Array, PLA), the electronic circuit can execute computer-readable program instructions, thereby realizing various aspects of the present application.
- These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processor of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram.
- These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in flowcharts and/or block diagrams.
- each block in a flowchart or block diagram may represent a module, a portion of a program segment, or an instruction that includes one or more Executable instructions.
- the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
- each block in the block diagrams and/or flowcharts, and combinations of blocks in the block diagrams and/or flowcharts can be implemented with hardware (such as circuits or ASIC (Application Specific Integrated Circuit, application-specific integrated circuit)), or it can be realized by a combination of hardware and software, such as firmware.
- hardware such as circuits or ASIC (Application Specific Integrated Circuit, application-specific integrated circuit)
- firmware such as firmware
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The present application relates to the field of data processing, and in particular to an operator scheduling operation time comparison method and device, and a storage medium. The method comprises: acquiring at least two candidate schedulings corresponding to a target computation expression, wherein the target computation expression is used for describing a computation logic of an operator; acquiring a cost comparison model, wherein the cost comparison model is a model obtained by training a neural network by using a plurality of sample schedulings; and according to the at least two candidate schedulings, invoking the cost comparison model to output a cost comparison result, wherein the cost comparison result is used for indicating a sorted order of magnitudes of execution durations of the at least two candidate schedulings on a target hardware platform. According to the present application, the relative magnitudes of the execution durations of different schedulings are directly compared without predicting absolute execution durations of the schedulings, thereby achieving an automatic tuning function of a compiler/automatic tuner, and greatly improving the speed and accuracy of the evaluation of scheduling operation cost.
Description
本申请涉及数据处理领域,特别涉及一种算子的调度运行时间比较方法、装置及存储介质。The present application relates to the field of data processing, and in particular to a method, device and storage medium for comparing scheduled running time of operators.
算子用于指示一种数据处理操作,比如,神经网络通常包括卷积算子和池化算子,卷积算子用于指示一种卷积操作,池化算子用于指示一种池化操作。为了能够在实际的硬件平台上运行算子,以执行对应的数据处理操作,需要生成算子的可执行代码。其中,算子的可执行代码的生成过程分为计算表达和调度两个步骤。计算表达是指通过特定语言描述算子的计算逻辑,也即是,描述算子需要完成的任务,以及算子的输入和输出,之后,将描述算子的计算逻辑的语言转换为中间语言,可以得到算子的中间表示信息(也称为模板)。调度是指根据目标硬件平台的硬件特性,对算子的中间表示信息进行调度优化。之后,可以将调度优化后的中间表示信息转换为目标硬件平台可识别的可执行代码。An operator is used to indicate a data processing operation. For example, a neural network usually includes a convolution operator and a pooling operator. The convolution operator is used to indicate a convolution operation, and the pooling operator is used to indicate a pooling operation. operation. In order to run the operator on the actual hardware platform to perform corresponding data processing operations, it is necessary to generate the executable code of the operator. Among them, the generation process of the operator's executable code is divided into two steps: calculation expression and scheduling. Computational expression refers to describing the computational logic of an operator through a specific language, that is, describing the tasks that the operator needs to complete, as well as the input and output of the operator, and then converting the language that describes the computational logic of the operator into an intermediate language. The operator's intermediate representation information (also called a template) can be obtained. Scheduling refers to scheduling and optimizing the intermediate representation information of operators according to the hardware characteristics of the target hardware platform. Afterwards, the scheduling-optimized intermediate representation information can be converted into executable code recognizable by the target hardware platform.
算子自动优化是优化工具和编译器的一个重要的功能,算子自动优化的难点在于需要在海量的调度所形成的调度空间中,搜索针对特定硬件平台的最优调度实现。如何评估神经网络中算子的不同调度在硬件平台上的执行时长是优化成功的重中之重。为了评估调度在特定硬件平台上的执行时长,相关技术中可以通过预先训练的代价模型来评估调度的绝对执行时长,从而实现对调度运行代价的评估。但是在这种方式中,预测的绝对执行时长与真实执行时长之间的误差比较大,并且需要专业人员针对特定硬件平台建立专用的代价模型,往往需要海量的训练数据,且模型结构复杂。此外,该种方式中由于预测误差比较大,导致无法消除相近预测值的调度之间代价比较的不确定性问题。Automatic operator optimization is an important function of optimization tools and compilers. The difficulty of automatic operator optimization is that it needs to search for the optimal scheduling implementation for a specific hardware platform in the scheduling space formed by massive scheduling. How to evaluate the execution time of different scheduling of operators in the neural network on the hardware platform is the most important thing for the success of optimization. In order to evaluate the execution time of scheduling on a specific hardware platform, in the related art, a pre-trained cost model can be used to evaluate the absolute execution time of scheduling, so as to realize the evaluation of scheduling running cost. However, in this method, the error between the predicted absolute execution time and the real execution time is relatively large, and professionals are required to build a dedicated cost model for a specific hardware platform, which often requires a large amount of training data and a complex model structure. In addition, due to the relatively large prediction error in this method, the uncertainty of cost comparison between schedules with similar predicted values cannot be eliminated.
相关技术中,尚未提供一种合理且有效的对调度运行代价进行评估的方法。In related technologies, a reasonable and effective method for evaluating scheduling operation cost has not been provided.
发明内容Contents of the invention
有鉴于此,提出了一种算子的调度运行时间比较方法、装置及存储介质。本申请实施例提供了一种算子的调度运行时间比较方法、装置及存储介质,在不预测调度绝对执行时长的前提下,直接比较不同调度的执行时长的相对大小,从而实现编译器/自动优化器的自动调优功能,大大提高了调度运行代价的评估速度和准确度。In view of this, a scheduling running time comparison method, device and storage medium of operators are proposed. The embodiment of the present application provides a scheduling running time comparison method, device, and storage medium of an operator. On the premise of not predicting the absolute execution time of the scheduling, the relative size of the execution time of different scheduling is directly compared, so as to realize the compiler/automatic The automatic tuning function of the optimizer greatly improves the evaluation speed and accuracy of the scheduling operation cost.
第一方面,本申请实施例提供了一种算子的调度运行时间比较方法,所述方法包括:In the first aspect, the embodiment of the present application provides a method for comparing the scheduled running time of operators, the method including:
获取目标计算表达对应的至少两个候选调度,所述目标计算表达用于描述算子的 计算逻辑,所述候选调度为基于所述目标计算表达生成的所述算子在目标硬件平台上的可执行代码;Obtain at least two candidate schedules corresponding to the target computing expression, the target computing expression is used to describe the computing logic of the operator, and the candidate scheduling is the available schedule of the operator on the target hardware platform generated based on the target computing expression execute code;
获取代价比较模型,所述代价比较模型为采用多个样本调度对神经网络进行训练得到的模型;Obtaining a cost comparison model, where the cost comparison model is a model obtained by training a neural network using multiple sample scheduling;
根据所述至少两个候选调度,调用代价比较模型输出得到代价对比结果,所述代价对比结果用于指示所述至少两个候选调度在所述目标硬件平台上的执行时长的大小排序。According to the at least two candidate schedules, the output of the cost comparison model is invoked to obtain a cost comparison result, and the cost comparison result is used to indicate the order of the execution durations of the at least two candidate schedules on the target hardware platform.
在该实现方式中,通过获取目标计算表达对应的至少两个候选调度,根据至少两个候选调度调用代价比较模型,直接比较至少两个候选调度在目标硬件平台上的执行时长的相对大小,从而输出得到用于指示执行时长大小排序的代价对比结果,可以实现编译器/自动优化器的自动调优功能,大大提高了调度运行代价的评估速度和准确度。In this implementation, by obtaining at least two candidate schedules corresponding to the target computing expression, and according to the call cost comparison model of the at least two candidate schedules, directly compare the relative size of the execution time of the at least two candidate schedules on the target hardware platform, thereby The output is the cost comparison result used to indicate the sorting of the execution time length, which can realize the automatic tuning function of the compiler/automatic optimizer, and greatly improve the evaluation speed and accuracy of the scheduling operation cost.
在一种可能的实现方式中,所述根据所述至少两个候选调度,调用代价比较模型输出得到代价对比结果,包括:In a possible implementation manner, according to the at least two candidate schedules, calling the cost comparison model to output the cost comparison result includes:
对所述至少两个候选调度进行预处理,得到预处理后的所述至少两个候选调度;Preprocessing the at least two candidate schedules to obtain the preprocessed at least two candidate schedules;
将预处理后的所述至少两个候选调度输入至所述代价比较模型中,输出得到所述代价对比结果;inputting the preprocessed at least two candidate schedules into the cost comparison model, and outputting the cost comparison result;
其中,所述代价比较模型是根据至少一组样本数据组训练得到的,每组所述样本数据组包括:样本计算表达对应的至少两个样本调度和预先标注的正确代价对比结果。Wherein, the cost comparison model is trained according to at least one set of sample data sets, and each set of sample data sets includes: at least two sample schedules corresponding to sample calculation expressions and pre-marked correct cost comparison results.
在该实现方式中,通过对至少两个候选调度进行预处理,得到预处理后的至少两个候选调度,将预处理后的至少两个候选调度输入至代价比较模型中,输出得到代价对比结果,由于代价比较模型是根据至少一组样本数据组训练得到的,保证了代价比较模型的高准确率,进一步保证了通过代价比较模型输出得到的代价对比结果的准确度。In this implementation, by preprocessing at least two candidate schedules, at least two preprocessed candidate schedules are obtained, and the preprocessed at least two candidate schedules are input into the cost comparison model, and the cost comparison result is output , since the cost comparison model is trained according to at least one set of sample data sets, the high accuracy of the cost comparison model is guaranteed, and the accuracy of the cost comparison result obtained through the output of the cost comparison model is further guaranteed.
在另一种可能的实现方式中,所述对所述至少两个候选调度进行预处理,得到预处理后的所述至少两个候选调度,包括:In another possible implementation manner, the preprocessing the at least two candidate schedules to obtain the preprocessed at least two candidate schedules includes:
对于所述至少两个候选调度中的每个所述候选调度,对所述候选调度进行特征提取得到特征矩阵;For each of the candidate schedules in the at least two candidate schedules, performing feature extraction on the candidate schedules to obtain a feature matrix;
对所述候选调度对应的所述特征矩阵进行归一化处理,得到预处理后的所述候选调度。Perform normalization processing on the feature matrix corresponding to the candidate schedule to obtain the preprocessed candidate schedule.
在该实现方式中,对于至少两个候选调度中的每个候选调度,对候选调度进行特征提取得到特征矩阵;对候选调度对应的特征矩阵进行归一化处理,得到预处理后的候选调度,通过对候选调度进行预处理转化为专门的数据结构,进一步保证了后续模型输出得到代价对比结果的准确度。In this implementation, for each candidate schedule in at least two candidate schedules, feature extraction is performed on the candidate schedule to obtain a feature matrix; the feature matrix corresponding to the candidate schedule is normalized to obtain a preprocessed candidate schedule, By preprocessing the candidate scheduling and transforming it into a special data structure, the accuracy of the cost comparison results obtained by the subsequent model output is further guaranteed.
在另一种可能的实现方式中,所述特征矩阵用于指示循环信息、输入数据形状信息、计算编码、轴类型编码和数据访问类型编码中的至少一种,所述循环信息包括与所述候选调度的循环计算逻辑相关的信息,所述输入数据形状信息用于描述所述算子的输入数据,所述计算编码包括所述候选调度的当前循环里用到的计算指令的编码,所述轴类型编码包括对轴进行操作的类型编码,所述数据访问类型编码包括对数据进行访问的类型编码。In another possible implementation manner, the characteristic matrix is used to indicate at least one of cycle information, input data shape information, calculation code, axis type code, and data access type code, and the cycle information includes the Information related to the calculation logic of the candidate scheduling cycle, the input data shape information is used to describe the input data of the operator, the calculation code includes the code of the calculation instruction used in the current cycle of the candidate schedule, the Axis type encoding includes type encoding for operations on the axis, and data access type encoding includes type encoding for accessing data.
在该实现方式中,将候选调度转化为特征矩阵,该特征矩阵可以包括5类信息中的至少一种,这5类信息为指示循环信息、输入数据形状信息、计算编码、轴类型编码和数据访问类型编码,从而将具有该数据结构的特征矩阵作为代价比较模型的输入数据,进一步提高了后续模型输出得到代价对比结果的准确度。In this implementation, the candidate schedule is converted into a feature matrix, which may include at least one of five types of information, which are indication cycle information, input data shape information, calculation code, axis type code, and data The access type is encoded, so that the feature matrix with this data structure is used as the input data of the cost comparison model, which further improves the accuracy of the cost comparison result obtained by the subsequent model output.
在另一种可能的实现方式中,所述获取代价比较模型之前还包括:In another possible implementation, the acquisition cost comparison model also includes:
获取训练样本集,所述训练样本集包括至少一组所述样本数据组;Obtain a training sample set, the training sample set includes at least one set of the sample data set;
对于每组所述样本数据组,对至少两个样本调度进行预处理得到预处理后的所述至少两个样本调度;For each set of sample data sets, preprocessing at least two sample schedules is performed to obtain the preprocessed at least two sample schedules;
将预处理后的所述至少两个样本调度输入原始参数模型得到训练结果,所述原始参数模型为神经网络模型;inputting the preprocessed at least two sample schedules into an original parameter model to obtain a training result, and the original parameter model is a neural network model;
将所述训练结果与所述正确代价对比结果进行比较,得到计算损失,所述计算损失用于指示所述训练结果与所述正确代价对比结果之间的误差;comparing the training result with the correct cost comparison result to obtain a calculation loss, the calculation loss being used to indicate an error between the training result and the correct cost comparison result;
根据所述至少一组样本数据组各自对应的计算损失,采用误差反向传播算法训练得到所述代价比较模型。According to the calculated losses corresponding to the at least one set of sample data groups, the cost comparison model is obtained through training with an error back propagation algorithm.
在该实现方式中,还通过在获取代价比较模型之前,获取训练样本集,训练样本集包括至少一组样本数据组;对于每组样本数据组,对至少两个样本调度进行预处理得到预处理后的至少两个样本调度;将预处理后的至少两个样本调度输入原始参数模型得到训练结果,原始参数模型为神经网络模型;将训练结果与正确代价对比结果进行比较,得到计算损失,计算损失用于指示训练结果与正确代价对比结果之间的误差;根据至少一组样本数据组各自对应的计算损失,采用误差反向传播算法训练得到代价比较模型,从而预先训练得到用于评估算子的调度运行代价的代价比较模型,保证了后续调用模型实现算子的调度运行时间比较方法的可行性。In this implementation, before obtaining the cost comparison model, a training sample set is obtained, and the training sample set includes at least one set of sample data groups; for each set of sample data sets, at least two sample schedules are preprocessed to obtain preprocessing at least two sample schedules after preprocessing; input at least two sample schedules after preprocessing into the original parameter model to obtain the training result, and the original parameter model is a neural network model; compare the training result with the correct cost comparison result to obtain the calculation loss, and calculate The loss is used to indicate the error between the training result and the correct cost comparison result; according to the calculated losses corresponding to at least one set of sample data groups, the error back propagation algorithm is used to train the cost comparison model, so as to obtain the pre-trained evaluation operator The cost comparison model of scheduling running cost ensures the feasibility of the subsequent calling model to realize the scheduling running time comparison method of operators.
在另一种可能的实现方式中,所述根据所述至少两个候选调度,调用代价比较模型输出得到代价对比结果之后,还包括:In another possible implementation manner, after calling the cost comparison model to output the cost comparison result according to the at least two candidate schedules, the method further includes:
将所述至少两个候选调度和所述代价对比结果添加至所述训练样本集,得到更新后的训练样本集;adding the at least two candidate schedules and the cost comparison result to the training sample set to obtain an updated training sample set;
根据所述更新后的训练样本集对所述代价比较模型进行训练,得到更新后的代价比较模型。The cost comparison model is trained according to the updated training sample set to obtain an updated cost comparison model.
在该实现方式中,通过将至少两个候选调度和代价对比结果添加至训练样本集,得到更新后的训练样本集;根据更新后的训练样本集对代价比较模型进行训练,得到更新后的代价比较模型,从而及时地对代价比较模型进行更新,不断提高代价比较模型的准确度。In this implementation, an updated training sample set is obtained by adding at least two candidate scheduling and cost comparison results to the training sample set; the cost comparison model is trained according to the updated training sample set to obtain the updated cost Compare models, so as to update the cost comparison model in time, and continuously improve the accuracy of the cost comparison model.
第二方面,本申请实施例提供了一种算子的调度运行时间比较装置,所述装置包括:In the second aspect, the embodiment of the present application provides an operator scheduling runtime comparison device, the device includes:
第一获取单元,用于获取目标计算表达对应的至少两个候选调度,所述目标计算表达用于描述算子的计算逻辑,所述候选调度为基于所述目标计算表达生成的所述算子的可执行代码;A first acquisition unit, configured to acquire at least two candidate schedules corresponding to a target calculation expression, the target calculation expression is used to describe the calculation logic of an operator, and the candidate schedule is the operator generated based on the target calculation expression the executable code;
第二获取单元,用于获取代价比较模型,所述代价比较模型为采用多个样本调度对神经网络进行训练得到的模型;The second acquisition unit is used to acquire a cost comparison model, and the cost comparison model is a model obtained by training a neural network by adopting multiple sample scheduling;
调用单元,用于根据所述至少两个候选调度,调用代价比较模型输出得到代价对比结果,所述代价对比结果用于指示所述至少两个候选调度的执行时长的大小排序。The calling unit is configured to call the output of the cost comparison model according to the at least two candidate schedules to obtain a cost comparison result, and the cost comparison result is used to indicate the order of the execution durations of the at least two candidate schedules.
在一种可能的实现方式中,所述调用单元,还用于:In a possible implementation manner, the calling unit is also used for:
对所述至少两个候选调度进行预处理,得到预处理后的所述至少两个候选调度;Preprocessing the at least two candidate schedules to obtain the preprocessed at least two candidate schedules;
将预处理后的所述至少两个候选调度输入至所述代价比较模型中,输出得到所述代价对比结果;inputting the preprocessed at least two candidate schedules into the cost comparison model, and outputting the cost comparison result;
其中,所述代价比较模型是根据至少一组样本数据组训练得到的,每组所述样本数据组包括:样本计算表达对应的至少两个样本调度和预先标注的正确代价对比结果。Wherein, the cost comparison model is trained according to at least one set of sample data sets, and each set of sample data sets includes: at least two sample schedules corresponding to sample calculation expressions and pre-marked correct cost comparison results.
在另一种可能的实现方式中,所述调用单元,还用于:In another possible implementation manner, the calling unit is also used for:
对于所述至少两个候选调度中的每个所述候选调度,对所述候选调度进行特征提取得到特征矩阵;For each of the candidate schedules in the at least two candidate schedules, performing feature extraction on the candidate schedules to obtain a feature matrix;
对所述候选调度对应的所述特征矩阵进行归一化处理,得到预处理后的所述候选调度。Perform normalization processing on the feature matrix corresponding to the candidate schedule to obtain the preprocessed candidate schedule.
在另一种可能的实现方式中,所述特征矩阵用于指示循环信息、输入数据形状信息、计算编码、轴类型编码和数据访问类型编码中的至少一种,所述循环信息包括与所述候选调度的循环计算逻辑相关的信息,所述输入数据形状信息用于描述所述算子的输入数据,所述计算编码包括所述候选调度的当前循环里用到的计算指令的编码,所述轴类型编码包括对轴进行操作的类型编码,所述数据访问类型编码包括对数据进行访问的类型编码。In another possible implementation manner, the characteristic matrix is used to indicate at least one of cycle information, input data shape information, calculation code, axis type code, and data access type code, and the cycle information includes the Information related to the calculation logic of the candidate scheduling cycle, the input data shape information is used to describe the input data of the operator, the calculation code includes the code of the calculation instruction used in the current cycle of the candidate schedule, the Axis type encoding includes type encoding for operations on the axis, and data access type encoding includes type encoding for accessing data.
在另一种可能的实现方式中,所述装置还包括训练单元;所述训练单元用于:In another possible implementation manner, the device further includes a training unit; the training unit is used for:
获取训练样本集,所述训练样本集包括至少一组所述样本数据组;Obtain a training sample set, the training sample set includes at least one set of the sample data set;
对于每组所述样本数据组,对至少两个样本调度进行预处理得到预处理后的所述至少两个样本调度;For each set of sample data sets, preprocessing at least two sample schedules is performed to obtain the preprocessed at least two sample schedules;
将预处理后的所述至少两个样本调度输入原始参数模型得到训练结果,所述原始参数模型为神经网络模型;inputting the preprocessed at least two sample schedules into an original parameter model to obtain a training result, and the original parameter model is a neural network model;
将所述训练结果与所述正确代价对比结果进行比较,得到计算损失,所述计算损失用于指示所述训练结果与所述正确代价对比结果之间的误差;comparing the training result with the correct cost comparison result to obtain a calculation loss, the calculation loss being used to indicate an error between the training result and the correct cost comparison result;
根据所述至少一组样本数据组各自对应的计算损失,采用误差反向传播算法训练得到所述代价比较模型。According to the calculated losses corresponding to the at least one set of sample data groups, the cost comparison model is obtained through training with an error back propagation algorithm.
在另一种可能的实现方式中,所述装置还包括更新单元;所述更新单元用于:In another possible implementation manner, the device further includes an update unit; the update unit is configured to:
将所述至少两个候选调度和所述代价对比结果添加至所述训练样本集,得到更新后的训练样本集;adding the at least two candidate schedules and the cost comparison result to the training sample set to obtain an updated training sample set;
根据所述更新后的训练样本集对所述代价比较模型进行训练,得到更新后的代价比较模型。The cost comparison model is trained according to the updated training sample set to obtain an updated cost comparison model.
第三方面,本申请实施例提供了一种算子的调度运行时间比较装置,所述装置包括:In the third aspect, the embodiment of the present application provides an operator scheduling runtime comparison device, the device includes:
处理器;processor;
用于存储处理器可执行指令的存储器;memory for storing processor-executable instructions;
其中,所述处理器被配置为执行所述指令时实现上述的方法。Wherein, the processor is configured to implement the above method when executing the instructions.
第四方面,本申请实施例提供了一种非易失性计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现上述的方法。In a fourth aspect, the embodiment of the present application provides a non-volatile computer-readable storage medium, on which computer program instructions are stored, and the above-mentioned method is implemented when the computer program instructions are executed by a processor.
第五方面,本申请实施例提供了一种计算机程序产品,所述计算机程序产品在计算机上运行时,所述计算机执行上述的方法。In a fifth aspect, an embodiment of the present application provides a computer program product, and when the computer program product is run on a computer, the computer executes the above-mentioned method.
包含在说明书中并且构成说明书的一部分的附图与说明书一起示出了本申请的示例性实施例、特征和方面,并且用于解释本申请的原理。The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate exemplary embodiments, features, and aspects of the application and, together with the specification, serve to explain the principles of the application.
图1示出了相关技术中调度空间的生成过程的示意图。Fig. 1 shows a schematic diagram of a generation process of a scheduling space in the related art.
图2示出了相关技术中实际测量法和代价模型法的原理示意图。Fig. 2 shows a schematic diagram of the principles of the actual measurement method and the cost model method in the related art.
图3示出了本申请一个示例性实施例提供的计算机设备的结构示意图。Fig. 3 shows a schematic structural diagram of a computer device provided by an exemplary embodiment of the present application.
图4示出了本申请一个示例性实施例提供的算子的调度运行时间比较方法的流程图。Fig. 4 shows a flowchart of a method for comparing scheduled running time of operators provided by an exemplary embodiment of the present application.
图5示出了本申请一个示例性实施例提供的算子的调度运行时间比较方法的原理示意图。Fig. 5 shows a schematic diagram of the principle of a method for comparing scheduled running time of operators provided by an exemplary embodiment of the present application.
图6示出了本申请一个示例性实施例提供的代价比较模型的训练过程的流程图。Fig. 6 shows a flow chart of the training process of the cost comparison model provided by an exemplary embodiment of the present application.
图7示出了本申请一个示例性实施例提供的代价比较模型的训练过程的原理示意图。Fig. 7 shows a schematic diagram of a training process of a cost comparison model provided by an exemplary embodiment of the present application.
图8示出了本申请一个示例性实施例提供的归一化函数的输入输出曲线的示意图。Fig. 8 shows a schematic diagram of an input-output curve of a normalization function provided by an exemplary embodiment of the present application.
图9示出了本申请一个示例性实施例提供的多层感知器架构的网络结构的示意图。Fig. 9 shows a schematic diagram of a network structure of a multi-layer perceptron architecture provided by an exemplary embodiment of the present application.
图10示出了本申请另一个示例性实施例提供的算子的调度运行时间比较方法的流程图。Fig. 10 shows a flowchart of a method for comparing scheduled running time of operators provided by another exemplary embodiment of the present application.
图11示出了本申请一个示例性实施例提供的特征矩阵的数据结构的示意图。Fig. 11 shows a schematic diagram of a data structure of a feature matrix provided by an exemplary embodiment of the present application.
图12示出了本申请另一个示例性实施例提供的代价比较模型的应用过程的原理示意图。Fig. 12 shows a schematic diagram of the application process of the cost comparison model provided by another exemplary embodiment of the present application.
图13示出了本申请一个示例性实施例提供的算子的调度运行时间比较装置的框图。Fig. 13 shows a block diagram of an apparatus for comparing scheduled runtimes of operators provided by an exemplary embodiment of the present application.
以下将参考附图详细说明本申请的各种示例性实施例、特征和方面。附图中相同的附图标记表示功能相同或相似的元件。尽管在附图中示出了实施例的各种方面,但是除非特别指出,不必按比例绘制附图。Various exemplary embodiments, features, and aspects of the present application will be described in detail below with reference to the accompanying drawings. The same reference numbers in the figures indicate functionally identical or similar elements. While various aspects of the embodiments are shown in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
在这里专用的词“示例性”意为“用作例子、实施例或说明性”。这里作为“示例性”所说明的任何实施例不必解释为优于或好于其它实施例。The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration." Any embodiment described herein as "exemplary" is not necessarily to be construed as superior or better than other embodiments.
另外,为了更好的说明本申请,在下文的具体实施方式中给出了众多的具体细节。本领域技术人员应当理解,没有某些具体细节,本申请同样可以实施。在一些实例中,对于本领域技术人员熟知的方法、手段、元件和电路未作详细描述,以便于凸显本申请的主旨。In addition, in order to better illustrate the present application, numerous specific details are given in the following specific implementation manners. It will be understood by those skilled in the art that the present application may be practiced without certain of the specific details. In some instances, methods, means, components and circuits well known to those skilled in the art have not been described in detail in order to highlight the gist of the present application.
随着人工智能技术的快速发展,深度学习在多个领域得到了广泛应用,这些应用对计算资源的需求快速增加,因此深度学习算法的优化愈加重要。深度学习技术通过建立深度学习模型,对大量历史数据进行迭代拟合(模型训练),使模型可以在输入输出之间建立映射关系,从而实现对新输入数据结果的预测(模型推理)。深度学习模型中包含大量的算子,例如:卷积算子,全连接算子,池化算子等。不同的算子堆叠、连接所形成的整体构成深度学习模型,也称之为神经网络模型。神经网络的拓扑架构称之为神经网络架构;神经网络中包含的算子的参数为模型参数。为使算子能够在特定硬件平台上高效执行,需要对算子的计算表达进行深度优化。其中,特定硬件平台可以是中央处理器(Central Processing Unit,CPU),也可以是图形处理器(Graphics Processing Unit,GPU),还可以是神经网络处理器(Neural network Processing Unit,NPU)。With the rapid development of artificial intelligence technology, deep learning has been widely used in many fields, and the demand for computing resources in these applications has increased rapidly, so the optimization of deep learning algorithms is becoming more and more important. Deep learning technology establishes a deep learning model and iteratively fits a large amount of historical data (model training), so that the model can establish a mapping relationship between input and output, thereby realizing the prediction of new input data results (model reasoning). The deep learning model contains a large number of operators, such as: convolution operator, fully connected operator, pooling operator, etc. The whole formed by the stacking and connection of different operators constitutes a deep learning model, also known as a neural network model. The topology of the neural network is called the neural network architecture; the parameters of the operators contained in the neural network are model parameters. In order to enable the operator to execute efficiently on a specific hardware platform, it is necessary to deeply optimize the calculation expression of the operator. Wherein, the specific hardware platform may be a central processing unit (Central Processing Unit, CPU), a graphics processing unit (Graphics Processing Unit, GPU), or a neural network processing unit (Neural network Processing Unit, NPU).
算子的计算表达可以有很多种实现方式,称之为调度,不同的调度在特定硬件平台上的性能差异可以非常大。从大量的调度实现中,利用编译器/自动优化器,自动的搜索针对特定硬件的最优调度即可实现对深度学习应用的优化,从而达到降低算力要求、增加系统吞吐量的目的。工程实现中,计算表达和调度之间可以存在一种中间表达,称之为模板。计算表达可以形成多个模板,每个模板又可以生成多个调度。There are many ways to implement the calculation expression of an operator, which is called scheduling. The performance difference of different scheduling on a specific hardware platform can be very large. From a large number of scheduling implementations, using a compiler/automatic optimizer to automatically search for the optimal scheduling for specific hardware can optimize deep learning applications, thereby reducing computing power requirements and increasing system throughput. In engineering implementation, there may be an intermediate expression between calculation expression and scheduling, which is called a template. Computation expressions can form multiple templates, and each template can generate multiple schedules.
算子自动优化是优化工具和编译器的一个重要的功能,自动优化后的算子性能优劣,决定了深度学习模型能否高效落地应用,满足产品需求。算子自动优化的难点在于需要在海量的调度所形成的调度空间中,搜索针对特定硬件平台的最优调度实现。如何评估神经网络中算子的不同调度在硬件平台上的执行时长是优化成功的重中之重。本文所述“代价”指调度在硬件平台上的执行时长。为评估调度在硬件平台上的执行时长,目前的方法主要有两钟:实际测量法和代价模型法。Automatic optimization of operators is an important function of optimization tools and compilers. The performance of operators after automatic optimization determines whether deep learning models can be efficiently applied and meet product requirements. The difficulty of operator automatic optimization is that it needs to search for the optimal scheduling implementation for a specific hardware platform in the scheduling space formed by massive scheduling. How to evaluate the execution time of different scheduling of operators in the neural network on the hardware platform is the most important thing for the success of optimization. The "cost" mentioned in this article refers to the execution time of scheduling on the hardware platform. In order to evaluate the execution time of a scheduler on a hardware platform, there are currently two main methods: the actual measurement method and the cost model method.
实际测量法,指的是将每个调度经过代码生成,代码编译,然后在硬件上运行。通过测量运行时间来得到具体的执行时间,这种方法需要经过完整的编译流程。其缺点是评估调度的时间长(秒级以上),在实际十万,百万级别的调度空间场景中耗时过长;受搜索时间的限制,很难探索更大的调度空间。The actual measurement method refers to the code generation of each schedule, the code compilation, and then running on the hardware. The specific execution time is obtained by measuring the running time, which requires a complete compilation process. Its disadvantage is that it takes a long time to evaluate the scheduling (above second level), which takes too long in the actual 100,000 or millions of scheduling space scenarios; limited by the search time, it is difficult to explore a larger scheduling space.
代价模型法,指的是通过建立代价模型来评估调度的执行时长。这种方法因为不需要经过编译、运行和测量过程,在评估耗时上具有非常明显的优势。The cost model method refers to evaluating the execution time of scheduling by establishing a cost model. Because this method does not need to go through the process of compiling, running and measuring, it has a very obvious advantage in terms of time-consuming evaluation.
相关技术中,基于代价模型的方法均采用预测调度的绝对执行时间来实现对调度运行代价的评估。但是在这种方式中,预测的绝对执行时长与真实执行时长之间的误差比较大,并且需要专业人员针对特定硬件平台建立专用的代价模型,往往需要海量的训练数据,且模型结构复杂。此外,该种方式中由于预测误差比较大,导致无法消除相近预测值的调度之间代价比较的不确定性问题。以上缺点限制了相关技术中的代价模型法在实际优化过程的应用。In related technologies, the methods based on the cost model all use the absolute execution time of the forecasted scheduling to realize the evaluation of the scheduling operation cost. However, in this method, the error between the predicted absolute execution time and the real execution time is relatively large, and professionals are required to build a dedicated cost model for a specific hardware platform, which often requires a large amount of training data and a complex model structure. In addition, due to the relatively large prediction error in this method, the uncertainty of cost comparison between schedules with similar predicted values cannot be eliminated. The above shortcomings limit the application of the cost model method in the related art in the actual optimization process.
本申请实施例提供了一种算子的调度运行时间比较方法、装置及存储介质,在不预测调度绝对执行时长的前提下,直接比较不同调度的执行时长的相对大小,从而实现编译器/自动优化器的自动调优功能,大大提高了调度运行代价的评估速度和准确度。对比相关技术中的方法,本申请实施例提供的算子的调度运行时间比较方法在速度和 准确度上均有很强的优势,提升算子优化器的性能并且显著降低评估时间。The embodiment of the present application provides a scheduling running time comparison method, device, and storage medium of an operator. On the premise of not predicting the absolute execution time of the scheduling, the relative size of the execution time of different scheduling is directly compared, so as to realize the compiler/automatic The automatic tuning function of the optimizer greatly improves the evaluation speed and accuracy of the scheduling operation cost. Compared with the methods in related technologies, the operator scheduling runtime comparison method provided by the embodiment of this application has strong advantages in speed and accuracy, improves the performance of the operator optimizer and significantly reduces the evaluation time.
首先,对本申请涉及的一些名词进行介绍。First, some nouns involved in this application are introduced.
1、计算表达(compute):指算子输入数据、输出数据和计算逻辑所构成的整体。计算表达是对具体计算过程进行描述的实例。在算子自动优化框架中,计算表达可以是用户自定义的,计算表达用于完成用户所需计算逻辑功能的全部信息。计算表达的形式通常为伪代码或结构化的流程图等形式,易于编写但未经优化。1. Computation expression (compute): refers to the whole composed of operator input data, output data and calculation logic. A calculation expression is an instance that describes a specific calculation process. In the operator automatic optimization framework, the calculation expression can be user-defined, and the calculation expression is used to complete all the information of the calculation logic functions required by the user. The form of computational expression is usually in the form of pseudocode or structured flowchart, which is easy to write but not optimized.
2、模板(template):计算表达可以经过一系列等效变换(transform)形成模板。模板为计算表达结构优化过程中,介于计算表达和调度之间的中间表示信息。通常来说,模板将计算表达逻辑中计算执行的顺序,数据访问的模式确定下来。2. Template: Computational expressions can be transformed into templates through a series of equivalent transformations. The template is the intermediate representation information between the calculation expression and the scheduling during the optimization process of the calculation expression structure. Generally speaking, the template determines the order of calculation execution in the calculation expression logic and the mode of data access.
模板改变计算表达的计算执行顺序和数据访问模式,但对计算表达的输入数据如何具体划分不做限制。如循环经过轴分割变换后,某单循环可以被分割成数个子循环,不同数量子循环的划分即为不同的模板。每个模板中,子循环的循环上下界只需满足与计算表达等效,但是每个子循环的循环上下界数值不确定。The template changes the calculation execution sequence and data access mode of the calculation expression, but does not restrict how the input data of the calculation expression is divided. For example, after the cycle is transformed by axis division, a single cycle can be divided into several sub-cycles, and the division of different numbers of sub-cycles is different templates. In each template, the loop upper and lower bounds of the sub-loop only need to be equivalent to the calculation expression, but the value of the loop upper and lower bounds of each sub-loop is uncertain.
3、调度(schedule):根据目标硬件平台的硬件特性,对算子的中间表示信息进行调度优化。调度确定了模板中全部可变参数的具体表达,可转化为软件实现的计算表达的描述。对相同的输入数据,调度的输出数据与计算表达的输出数据完全一致,但执行计算过程可以不同。3. Schedule: According to the hardware characteristics of the target hardware platform, the intermediate representation information of the operator is scheduled and optimized. Scheduling determines the specific expression of all variable parameters in the template, which can be transformed into a description of the calculation expression implemented by software. For the same input data, the scheduled output data is exactly the same as the output data of the calculation expression, but the calculation process can be different.
4、特征嵌入(feature embedding):输入数据经过神经网络模块后的中间输出。特征嵌入是神经网络模块对输入数据在另外一个空间上的映射,包括对输入数据的提取、增强和编码。4. Feature embedding: the intermediate output of the input data after passing through the neural network module. Feature embedding is the mapping of the neural network module to the input data in another space, including the extraction, enhancement and encoding of the input data.
5、多层感知器:一种由全连接层、激活层等层叠组成的神经网络基本单元。多层感知器可以形成整体神经网络架构,也可以在整体架构中的一部分作为模块出现。5. Multi-layer perceptron: a basic unit of neural network composed of fully connected layers, activation layers, etc. Multilayer perceptrons can form an overall neural network architecture, or they can appear as modules within a part of an overall architecture.
在一个示意性的例子中,以单循环的变换为例(伪代码),调度空间的生成过程如图1所示,计算机设备(比如自动优化器)获取用户输入的计算表达,将计算表达进行变换,生成模板空间,模板空间中的模板可以转化为逻辑等效于计算表达的调度实现。合法的调度的集合形成调度空间。计算机设备在调度空间中搜索,输出最优调度实现。在该例子中,计算表达为用户自定义的一个循环计算,循环体内的计算逻辑由陈列式(statement,stmt)代表。循环的上下界由0到546756。经过轴分割变换,计算表达的单循环可以等效的转换为双重嵌套循环,三重嵌套循环至N重嵌套循环的模板,模板中,每层嵌套循环的循环上下界并未确定,通过轴分割可以对数据访问的模式做出不同规划。而将与计算表达的循环上下界等效的循环边界值填入模板中,并对循环陈列式进行合理变形或约束(比如,图1中stmt_tpln_immd_constrain可以是第n个模板的中间约束陈列式),形成的与计算表达等效逻辑的代码即为调度。In a schematic example, taking the single-loop transformation as an example (pseudo-code), the generation process of the scheduling space is shown in Figure 1. The computer equipment (such as an automatic optimizer) obtains the calculation expression input by the user, and performs the calculation expression Transformation generates a template space, and the templates in the template space can be transformed into scheduling implementations whose logic is equivalent to computational expressions. The set of valid schedules forms the schedule space. The computer device searches in the scheduling space and outputs the optimal scheduling realization. In this example, the calculation is expressed as a user-defined loop calculation, and the calculation logic in the loop body is represented by a statement (stmt). The upper and lower bounds of the loop are from 0 to 546756. After axis division and transformation, the single loop of calculation expression can be equivalently transformed into double nested loop, triple nested loop to N-fold nested loop template. In the template, the loop upper and lower bounds of each nested loop are not determined. Axis splits allow for different plans for data access patterns. Fill in the template with the cycle boundary value equivalent to the cycle upper and lower bounds expressed by the calculation, and reasonably deform or constrain the cycle display (for example, stmt_tpln_immd_constrain in Figure 1 can be the middle constraint display of the nth template), The formed code that expresses the logic equivalent to calculation is scheduling.
在实际场景中,复杂的计算逻辑所构成的计算表达,通常可以衍生出千百万量级的调度实现,其中不同调度在目标硬件平台上的执行时长可以相差成百上千倍。自动优化系统即通过一系列操作在海量调度所构成的搜索空间中,搜索硬件执行最优的调度以实现对算子的优化。In actual scenarios, the calculation expression composed of complex calculation logic can usually derive tens of millions of schedule implementations, and the execution time of different schedules on the target hardware platform can vary by hundreds or thousands of times. The automatic optimization system uses a series of operations to search for hardware to perform the optimal scheduling in the search space formed by massive scheduling to optimize operators.
如图2所示,实际测量法根据调度的定义,生成合法代码,经过编译器编译,在 硬件上执行并测量,获得性能评估结果。结果通常为调度的执行时长,也可以为运行所需的硬件时钟周期计数。通过对有限数量的调度实现的实际测量,最终选择执行时长最短(硬件时钟周期计数最小)的调度。以该方式获得的执行时长最为准确真实,缺点是代码生成和编译过程通常需要数秒钟至数分钟才可以完成,运行和测量过程的耗时取决于算子计算量及复杂度,在大规模选优过程中速度非常慢。As shown in Figure 2, the actual measurement method generates legal code according to the definition of scheduling, compiles it with a compiler, executes and measures it on the hardware, and obtains performance evaluation results. The result is usually the execution time of the schedule, and can also be a count of hardware clock cycles required to run. Through the actual measurement of a limited number of scheduling implementations, the scheduling with the shortest execution time (smallest hardware clock cycle count) is finally selected. The execution time obtained in this way is the most accurate and true. The disadvantage is that the code generation and compilation process usually takes several seconds to several minutes to complete. The time consumption of the operation and measurement process depends on the calculation amount and complexity of the operator. The optimization process is very slow.
而基于机器学习模型的搜索选优方法可以大大加速上述过程,将数秒钟的代码生成,编译运行过程缩短为毫秒级的神经网络推理过程。同时由于模型预测的准确率限制,选优的效果可能会出现下降。目前的代价模型如前文所述,在对调度进行特征提取后,通过调用该代价模型预测该调度的绝对执行时长或运行周期数。该方法中代价模型的准确度低,相关技术中的代价模型对调度执行时长的预测平均有16%的误差,而相当多的调度之间的真实执行时长差异都小于16%误差值;除误差外,获取代价模型的成本高昂,训练需要180万训练数据,网络架构复杂度高,训练收敛时间长。因此该方法无法很好达到快速并准确地算子搜索优化的目的。The search and selection method based on the machine learning model can greatly accelerate the above process, shortening the process of code generation, compilation and running in a few seconds to the neural network reasoning process in milliseconds. At the same time, due to the limitation of the accuracy of model prediction, the effect of optimal selection may decline. The current cost model is as described above. After feature extraction is performed on the schedule, the absolute execution time or number of running cycles of the schedule is predicted by calling the cost model. The accuracy of the cost model in this method is low, and the cost model in the related art has an average error of 16% in the prediction of the scheduling execution time, and the actual execution time difference between quite a lot of scheduling is less than the 16% error value; except for the error In addition, the cost of obtaining the cost model is high, training requires 1.8 million training data, the complexity of the network architecture is high, and the training convergence time is long. Therefore, this method cannot well achieve the purpose of fast and accurate operator search optimization.
本申请实施例提供了一种新的代价模型:代价比较模型。代价比较模型避免直接对调度执行时长的预测,将回归问题转化为易于神经网络学习的分类问题。代价比较模型以至少两个候选调度作为输入,输出结果为代价对比结果,代价对比结果用于指示至少两个候选调度在目标硬件平台上的执行时长的大小排序。本申请实施例提供的方法具有准确度高,推理速度快,所需要的训练成本相对现存方法较低的优点。在算子优化流程中,本申请实施例提供的代价比较模型可以快速比较不同调度的执行时长,从而实现对算子的大规模搜索优化。The embodiment of the present application provides a new cost model: a cost comparison model. The cost comparison model avoids directly predicting the scheduling execution time, and transforms the regression problem into a classification problem that is easy for neural network to learn. The cost comparison model takes at least two candidate schedules as input, and the output result is a cost comparison result, and the cost comparison result is used to indicate the order of the execution time of the at least two candidate schedules on the target hardware platform. The method provided by the embodiment of the present application has the advantages of high accuracy, fast inference speed, and lower training cost than existing methods. In the operator optimization process, the cost comparison model provided by the embodiment of the present application can quickly compare the execution time of different schedules, thereby realizing large-scale search optimization of operators.
需要说明的是,本申请实施例提供的算子的调度运行时间比较方法,可应用于算子自动优化系统的优化过程中。本申请实施例的核心内容为代价比较模型,包括代价比较模型的模型架构、模型训练过程和模型应用过程。本申请实施例提供的算子的调度运行时间比较方法可以应用于特定的计算机设备(比如CPU或者GPU或者NPU),对目标计算表达的多个候选调度实现进行大规模比较、搜索,从而获得最优的调度,达到目标计算表达在特定的计算机设备上优化的目的。It should be noted that the operator scheduling runtime comparison method provided in the embodiment of the present application can be applied to the optimization process of the automatic operator optimization system. The core content of the embodiment of the present application is the cost comparison model, including the model architecture, model training process and model application process of the cost comparison model. The operator scheduling runtime comparison method provided in the embodiment of the present application can be applied to a specific computer device (such as CPU or GPU or NPU), and performs large-scale comparison and search on multiple candidate scheduling implementations of the target computing expression, so as to obtain the optimal Optimal scheduling to achieve the purpose of optimizing the target computing expression on a specific computer device.
本申请实施例提供的算子的调度运行时间比较方法的执行主体为计算机设备,该计算机设备可以是通用计算机设备或专用计算设备。请参考图3,其示出了本申请一个示例性实施例提供的计算机设备的结构示意图。The execution subject of the method for comparing the scheduled running time of operators provided in the embodiment of the present application is a computer device, which may be a general-purpose computer device or a special-purpose computing device. Please refer to FIG. 3 , which shows a schematic structural diagram of a computer device provided by an exemplary embodiment of the present application.
该计算机设备可以是终端或者服务器。终端包括平板电脑、膝上型便携计算机和台式计算机等等。服务器可以是一台服务器,或者由若干台服务器组成的服务器集群,或者是一个云计算服务中心。The computer device may be a terminal or a server. Terminals include tablet computers, laptop computers, and desktop computers, among others. The server can be one server, or a server cluster composed of several servers, or a cloud computing service center.
如图1所示,计算机设备包括处理器10、存储器20以及通信接口30。本领域技术人员可以理解,图1中示出的结构并不构成对该计算机设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。其中:As shown in FIG. 1 , the computer device includes a processor 10 , a memory 20 and a communication interface 30 . Those skilled in the art can understand that the structure shown in FIG. 1 does not constitute a limitation to the computer device, and may include more or less components than those shown in the illustration, or combine some components, or arrange different components. in:
处理器10是计算机设备的控制中心,利用各种接口和线路连接整个计算机设备的各个部分,通过运行或执行存储在存储器20内的软件程序和/或模块,以及调用存储在存储器20内的数据,执行计算机设备的各种功能和处理数据,从而对计算机设备进 行整体控制。处理器10可以由CPU实现,也可以由GPU实现。The processor 10 is the control center of the computer equipment, and uses various interfaces and lines to connect various parts of the entire computer equipment, by running or executing software programs and/or modules stored in the memory 20, and calling data stored in the memory 20 , to perform various functions of the computer equipment and process data, thereby controlling the computer equipment as a whole. The processor 10 may be implemented by a CPU, or may be implemented by a GPU.
存储器20可用于存储软件程序以及模块。处理器10通过运行存储在存储器20的软件程序以及模块,从而执行各种功能应用以及数据处理。存储器20可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统21、第一获取单元22、第二获取单元23、检测单元24和至少一个功能所需的应用程序25(比如神经网络训练等)等;存储数据区可存储根据计算机设备的使用所创建的数据等。存储器20可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(Static Random Access Memory,SRAM),电可擦除可编程只读存储器(Electrically Erasable Programmable Read-Only Memory,EEPROM),可擦除可编程只读存储器(Erasable Programmable Read Only Memory,EPROM),可编程只读存储器(Programmable Read-Only Memory,PROM),只读存储器(Read Only Memory,ROM),磁存储器,快闪存储器,磁盘或光盘。相应地,存储器20还可以包括存储器控制器,以提供处理器10对存储器20的访问。The memory 20 can be used to store software programs as well as modules. The processor 10 executes various functional applications and data processing by executing software programs and modules stored in the memory 20 . The memory 20 can mainly include a program storage area and a data storage area, wherein the program storage area can store an operating system 21, a first acquisition unit 22, a second acquisition unit 23, a detection unit 24 and at least one functionally required application program 25 ( Such as neural network training, etc.); the storage data area can store data created according to the use of computer equipment, etc. Memory 20 can be realized by any type of volatile or nonvolatile memory device or their combination, such as Static Random Access Memory (Static Random Access Memory, SRAM), Electrically Erasable Programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), Erasable Programmable Read Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read Only Memory (Read Only Memory, ROM), magnetic memory, flash memory, magnetic or optical disk. Correspondingly, the memory 20 may also include a memory controller to provide the processor 10 with access to the memory 20 .
其中,处理器20通过运行第一获取单元22执行以下功能:获取目标计算表达对应的至少两个候选调度,所述目标计算表达用于描述算子的计算逻辑,所述候选调度为基于所述目标计算表达生成的所述算子的可执行代码;处理器20通过第二获取单元23执行以下功能:获取代价比较模型,所述代价比较模型为采用多个样本调度对神经网络进行训练得到的模型;处理器20通过调用单元24执行以下功能:根据所述至少两个候选调度,调用代价比较模型输出得到代价对比结果,所述代价对比结果用于指示所述至少两个候选调度的执行时长的大小排序。Wherein, the processor 20 executes the following function by running the first acquisition unit 22: acquire at least two candidate schedules corresponding to the target calculation expression, the target calculation expression is used to describe the calculation logic of the operator, and the candidate schedule is based on the The executable code of the operator generated by the target calculation expression; the processor 20 performs the following functions through the second acquisition unit 23: acquire a cost comparison model, and the cost comparison model is obtained by training the neural network using multiple sample scheduling Model; the processor 20 performs the following functions through the calling unit 24: according to the at least two candidate schedules, call the cost comparison model output to obtain a cost comparison result, and the cost comparison result is used to indicate the execution duration of the at least two candidate schedules sorted by size.
可选地,计算机设备获取用户输入的计算表达代码即目标计算表达,通过算子优化系统对该目标计算表达进行分析,基于优化规则或多面体模型等方法生成模板空间,通过对模板实例化生成大量合法的候选调度,生成的多个候选调度构成调度空间。调度空间中的一个实例代表一个合法的调度,本申请实施例提供的代价比较模型作为评价模块,对输入的至少两个候选调度进行比较输出得到代价对比结果,从而实现在调度空间中寻找最优调度的目标。Optionally, the computer device obtains the calculation expression code input by the user, that is, the target calculation expression, analyzes the target calculation expression through the operator optimization system, generates a template space based on optimization rules or polyhedron models, and generates a large number of A legal candidate schedule, the generated multiple candidate schedules form a scheduling space. An instance in the scheduling space represents a legal scheduling. The cost comparison model provided by the embodiment of this application is used as an evaluation module to compare and output at least two candidate schedulings input to obtain the cost comparison result, so as to realize the search for the optimal scheduling space. The target of the schedule.
下面,采用示意性的实施例对算子的调度运行时间比较方法进行介绍。In the following, a method for comparing scheduled running time of operators is introduced by using a schematic embodiment.
请参考图4,其示出了本申请一个示例性实施例提供的算子的调度运行时间比较方法的流程图。本实施例以该算子的调度运行时间比较方法应用于图3所示出的计算机设备来举例说明。该算子的调度运行时间比较方法包括:Please refer to FIG. 4 , which shows a flowchart of a method for comparing scheduled running time of operators provided by an exemplary embodiment of the present application. This embodiment is described by taking the method for comparing the scheduling running time of the operator applied to the computer device shown in FIG. 3 as an example. The scheduling running time comparison methods of this operator include:
步骤401,获取目标计算表达对应的至少两个候选调度,目标计算表达用于描述算子的计算逻辑,候选调度为基于目标计算表达生成的算子在目标硬件平台上的可执行代码。Step 401: Acquire at least two candidate schedules corresponding to the target computing expression. The target computing expression is used to describe the computing logic of the operator. The candidate schedule is the executable code of the operator generated based on the target computing expression on the target hardware platform.
可选地,计算机设备从目标计算表达对应的调度空间中获取至少两个候选调度。示意性的,计算机设备获取输入的目标计算表达,对目标计算表达进行分析,根据预设方式生成模板空间,通过对模板实例化生成多个候选调度,生成的多个候选调度构成调度空间。计算机设备从该调度空间中获取至少两个候选调度。Optionally, the computer device acquires at least two candidate schedules from the schedule space corresponding to the target computation expression. Schematically, the computer device obtains the input target calculation expression, analyzes the target calculation expression, generates a template space according to a preset method, generates multiple candidate schedules by instantiating the template, and the generated multiple candidate schedules constitute a scheduling space. The computer device obtains at least two candidate schedules from the schedule space.
可选地,预设方式为动态规划法或者优化规则法或者多面体模型法。针对不同计 算系统,预设方式也可能有差异。本申请实施例对调度空间的生成算法不加以限定。调度空间中需保证包括至少两个候选调度用于比较才可以应用本申请实施例。Optionally, the preset method is a dynamic programming method or an optimization rule method or a polyhedron model method. For different computing systems, the preset methods may also be different. The embodiment of the present application does not limit the generation algorithm of the scheduling space. The embodiment of the present application can only be applied if at least two candidate schedules are included in the scheduling space for comparison.
其中,目标计算表达为特定的计算表达,比如目标计算表达为输入的计算表达。Wherein, the target calculation expression is a specific calculation expression, for example, the target calculation expression is an input calculation expression.
候选调度为基于目标计算表达生成的算子在目标硬件平台上的可执行代码。比如,目标硬件平台为CPU或者GPU或者NPU。本申请实施例对此不加以限定。The candidate scheduler is the executable code of the operator generated based on the target computing expression on the target hardware platform. For example, the target hardware platform is CPU, GPU or NPU. This embodiment of the present application does not limit it.
可选地,计算机设备在接收到预设获取指令时,获取目标计算表达对应的至少两个候选调度。或者,计算机设备每隔预设时间间隔获取目标计算表达对应的至少两个候选调度。或者,计算机设备实时获取目标计算表达对应的至少两个候选调度。Optionally, when the computer device receives the preset acquisition instruction, it acquires at least two candidate schedules corresponding to the target computing expression. Alternatively, the computer device acquires at least two candidate schedules corresponding to the target computing expression every preset time interval. Alternatively, the computer device acquires at least two candidate schedules corresponding to the target computing expression in real time.
其中,预设时间间隔为默认设置的,或者自定义设置的,本实施例对此不加以限定。Wherein, the preset time interval is a default setting or a custom setting, which is not limited in this embodiment.
步骤402,获取代价比较模型,代价比较模型为采用多个样本调度对神经网络进行训练得到的模型。 Step 402, acquiring a cost comparison model, which is a model obtained by training a neural network by using multiple sample scheduling.
计算机设备获取训练好的代价比较模型。在一种可能的实现方式中,当计算机设备为终端时,终端获取自身存储的训练好的代价比较模型,或者从服务器中获取训练好的代价比较模型。在另一种可能的实现方式中,当计算机设备为服务器时,服务器获取自身存储的训练好的代价比较模型。The computer device obtains the trained cost comparison model. In a possible implementation manner, when the computer device is a terminal, the terminal obtains a trained cost comparison model stored by itself, or obtains a trained cost comparison model from a server. In another possible implementation manner, when the computer device is a server, the server obtains a trained cost comparison model stored in itself.
代价比较模型为采用至少两个样本调度和正确代价对比结果对神经网络进行训练得到的模型。即代价比较模型是根据至少两个样本调度和正确代价对比结果所确定的。其中,正确代价对比结果为预先标注的与至少两个样本调度对应的正确代价对比结果。The cost comparison model is a model obtained by training the neural network by using at least two sample scheduling and correct cost comparison results. That is, the cost comparison model is determined according to at least two sample scheduling and correct cost comparison results. Wherein, the correct cost comparison result is a pre-marked correct cost comparison result corresponding to at least two sample schedules.
其中,代价比较模型的神经网络可以采用端到端的堆叠的多层感知器架构。其他合理的变形架构也同样可以实现代价比较模型的拟合功能,不同架构对模型最终的准确率会有影响。任何对本架构的变形、衍生、层替换所构成的网络架构均应视为等价于本申请实施例所述的神经网络。Among them, the neural network of the cost comparison model can adopt an end-to-end stacked multi-layer perceptron architecture. Other reasonable deformation architectures can also achieve the fitting function of the cost comparison model, and different architectures will affect the final accuracy of the model. Any network architecture formed by deformation, derivation, and layer replacement of this architecture should be regarded as equivalent to the neural network described in the embodiments of this application.
比如,神经网络为深度神经网络(Deep Neural Network,DNN)。比如,神经网络为卷积神经网络(Convolutional Neural Network,CNN)。又比如,神经网络为循环神经网络(Recurrent Neural Network,RNN)。本申请实施例对此不加以限定。For example, the neural network is a deep neural network (Deep Neural Network, DNN). For example, the neural network is a Convolutional Neural Network (CNN). For another example, the neural network is a Recurrent Neural Network (RNN). This embodiment of the present application does not limit it.
代价比较模型是具有对至少两个候选调度在目标硬件平台上的相对执行时长进行识别的神经网络模型。The cost comparison model is a neural network model that identifies relative execution times of at least two candidate schedules on the target hardware platform.
代价比较模型用于将输入的至少两个候选调度转化为代价对比结果。该代价对比结果用于指示至少两个候选调度在目标硬件平台上的执行时长的大小排序。The cost comparison model is used to convert the input at least two candidate schedules into cost comparison results. The cost comparison result is used to indicate the ranking of the execution durations of the at least two candidate schedules on the target hardware platform.
代价比较模型用于表示至少两个候选调度与代价对比结果之间的相关关系。The cost comparison model is used to represent the correlation between at least two candidate schedules and the cost comparison results.
代价比较模型为预设的数学模型,该代价比较模型包括至少两个候选调度与代价对比结果之间的模型系数。模型系数可以为固定值,也可以是随时间动态修改的值,还可以是随着使用场景动态修改的值。The cost comparison model is a preset mathematical model, and the cost comparison model includes model coefficients between at least two candidate schedules and cost comparison results. The model coefficient can be a fixed value, or a value that is dynamically modified over time, or a value that is dynamically modified according to a usage scenario.
步骤403,根据至少两个候选调度,调用代价比较模型输出得到代价对比结果,代价对比结果用于指示至少两个候选调度在目标硬件平台上的执行时长的大小排序。Step 403: According to the at least two candidate schedules, invoke the cost comparison model output to obtain a cost comparison result, and the cost comparison result is used to indicate the order of execution duration of the at least two candidate schedules on the target hardware platform.
可选的,计算机设备对至少两个候选调度进行预处理,得到预处理后的至少两个候选调度;将预处理后的至少两个候选调度输入至代价比较模型中,输出得到代价对比结果。Optionally, the computer device performs preprocessing on at least two candidate schedules to obtain at least two preprocessed candidate schedules; input the preprocessed at least two candidate schedules into the cost comparison model, and output the cost comparison results.
其中,代价对比结果用于指示至少两个候选调度在目标硬件平台上的执行时长的大小排序。即代价对比结果不指示至少两个候选调度在目标硬件平台上的绝对执行时长,而是指示至少两个候选调度在目标硬件平台上的执行时长的相对大小。Wherein, the cost comparison result is used to indicate the order of the execution durations of at least two candidate schedules on the target hardware platform. That is, the cost comparison result does not indicate the absolute execution time of the at least two candidate schedules on the target hardware platform, but indicates the relative size of the execution time of the at least two candidate schedules on the target hardware platform.
可选地,代价对比结果为预测的至少两个候选调度的执行时长的比较结果的编码信息。计算机设备对代价比较模型输出的编码信息进行解码,得到至少两个候选调度的执行时长的大小排序即比较结果。Optionally, the cost comparison result is coding information of a comparison result of the predicted execution durations of at least two candidate schedules. The computer device decodes the coded information output by the cost comparison model, and obtains the order of the execution durations of at least two candidate schedules, that is, the comparison result.
示意性的,代价对比结果包括编码信息,编码信息的取值与至少两个候选调度的执行时长比较结果一一对应。示意性的,以至少两个候选调度为第一候选调度和第二候选调度为例,编码信息为第一数值时用于指示第一候选调度的执行时长小于第二候选调度的执行时长,编码信息为第二数值时用于指示第一候选调度的执行时长等于第二候选调度的执行时长,编码信息为第三数值时用于指示第一候选调度的执行时长大于第二候选调度的执行时长,其中第一数值、第二数值和第三数值各不相同。Schematically, the cost comparison result includes encoding information, and the value of the encoding information is in one-to-one correspondence with the execution duration comparison results of at least two candidate schedules. Schematically, taking at least two candidate schedules as a first candidate schedule and a second candidate schedule as an example, when the encoded information is a first value, it is used to indicate that the execution duration of the first candidate schedule is shorter than the execution duration of the second candidate schedule, and the encoding When the information is the second value, it is used to indicate that the execution duration of the first candidate schedule is equal to the execution duration of the second candidate schedule, and when the encoded information is the third value, it is used to indicate that the execution duration of the first candidate schedule is greater than the execution duration of the second candidate schedule , where the first, second, and third values are different.
可选的,计算机设备根据至少两个候选调度的代价对比结果,将至少两个候选调度中执行时长最短的候选调度作为目标调度,保留目标调度,丢弃除目标调度以外的其它候选调度。Optionally, the computer device selects the candidate schedule with the shortest execution time among the at least two candidate schedules as the target schedule according to the cost comparison results of the at least two candidate schedules, retains the target schedule, and discards other candidate schedules except the target schedule.
可选地,在代价对比结果指示至少两个候选调度的执行时长均相同的情况下,计算机设备将至少两个候选调度中的任意一个候选调度作为目标调度,保留目标调度,丢弃除目标调度以外的其它候选调度。本申请实施例对调度的保留和丢弃方式不加以限定。Optionally, when the cost comparison result indicates that the execution durations of at least two candidate schedules are the same, the computer device takes any one of the at least two candidate schedules as the target schedule, retains the target schedule, and discards all but the target schedule. other candidate schedules. The embodiment of the present application does not limit the method of retaining and discarding the scheduling.
在一个示意性的例子中,如图5所示,计算机设备获取输入的目标计算表达,对目标计算表达进行分析,根据预设方式生成模板空间,通过对模板实例化生成多个候选调度,生成的多个候选调度构成调度空间。从该调度空间中获取两个候选调度,比如调度A和调度B。对调度A和调度B进行预处理,得到预处理后的调度A和调度B;将预处理后的调度A和调度B输入至代价比较模型中输出得到编码信息,对编码信息进行解码得到调度A和调度B的代价对比结果。比如,编码信息为001时用于指示调度A的执行时长小于调度B的执行时长,保留调度A,并丢弃调度B;编码信息为002时用于指示调度A的执行时长等于调度B的执行时长,保留调度A或调度B;编码信息为100时用于指示调度A的执行时长大于调度B的执行时长,保留调度B,并丢弃调度A。In a schematic example, as shown in Figure 5, the computer device obtains the input target computing expression, analyzes the target computing expression, generates a template space according to a preset method, generates multiple candidate schedules by instantiating the template, and generates A plurality of candidate schedules constitute a scheduling space. Obtain two candidate schedules, such as schedule A and schedule B, from the schedule space. Preprocess the schedule A and schedule B to obtain the preprocessed schedule A and schedule B; input the preprocessed schedule A and schedule B into the cost comparison model to output encoded information, and decode the encoded information to obtain schedule A Compare the result with the cost of scheduling B. For example, when the encoding information is 001, it is used to indicate that the execution duration of schedule A is less than that of schedule B, and schedule A is retained, and schedule B is discarded; when the encoding information is 002, it is used to indicate that the execution duration of schedule A is equal to the execution duration of schedule B , keep schedule A or schedule B; when the encoding information is 100, it indicates that the execution time of schedule A is longer than that of schedule B, keep schedule B and discard schedule A.
综上所述,本申请实施例通过获取目标计算表达对应的至少两个候选调度,根据至少两个候选调度调用代价比较模型,直接比较至少两个候选调度在目标硬件平台上的执行时长的相对大小,从而输出得到用于指示执行时长大小排序的代价对比结果,可以实现编译器/自动优化器的自动调优功能,大大提高了调度运行代价的评估速度和准确度。To sum up, the embodiment of the present application obtains at least two candidate schedules corresponding to the target computing expression, and directly compares the relative execution time of the at least two candidate schedules on the target hardware platform according to the call cost comparison model of the at least two candidate schedules. Size, so as to output the cost comparison result used to indicate the size of the execution time, which can realize the automatic tuning function of the compiler/automatic optimizer, and greatly improve the evaluation speed and accuracy of the scheduling operation cost.
需要说明的是,在计算机设备获取代价比较模型之前,需要对训练样本集进行训练得到代价比较模型。下面对代价比较模型的训练过程进行介绍。It should be noted that before the computer device acquires the cost comparison model, it needs to train the training sample set to obtain the cost comparison model. The training process of the cost comparison model is introduced below.
在一种可能的实现方式中,如图6所示,对代价比较模型的训练过程包括如下几个步骤:In a possible implementation, as shown in Figure 6, the training process of the cost comparison model includes the following steps:
步骤601,获取训练样本集,训练样本集包括至少一组样本数据组。In step 601, a training sample set is obtained, and the training sample set includes at least one set of sample data groups.
代价比较模型是根据至少一组样本数据组训练得到的,每组样本数据组包括:样本计算表达对应的至少两个样本调度和预先标注的正确代价对比结果。The cost comparison model is trained according to at least one set of sample data sets, and each set of sample data sets includes: at least two sample schedules corresponding to sample calculation expressions and pre-marked correct cost comparison results.
步骤602,对于每组样本数据组,对至少两个样本调度进行预处理得到预处理后的至少两个样本调度。 Step 602, for each set of sample data groups, perform preprocessing on at least two sample schedules to obtain at least two preprocessed sample schedules.
对于每组样本数据组,计算机设备对至少两个样本调度中的每个样本调度进行特征提取得到特征矩阵,对样本调度对应的特征矩阵进行归一化处理,得到预处理后的样本调度。For each sample data group, the computer device performs feature extraction on each sample schedule in at least two sample schedules to obtain a feature matrix, and normalizes the feature matrix corresponding to the sample schedule to obtain a preprocessed sample schedule.
示意性的,特征提取是从样本调度中提取特征,并将特征转换为结构化数据的过程。Schematically, feature extraction is the process of extracting features from sample schedules and converting features into structured data.
需要说明的是,特征矩阵的相关描述可参考下面实施例中的相关细节,在此先不介绍。It should be noted that, for the relevant description of the feature matrix, reference may be made to the relevant details in the following embodiments, which will not be introduced here.
步骤603,将预处理后的至少两个样本调度输入原始参数模型得到训练结果,原始参数模型为神经网络模型。 Step 603, input the preprocessed at least two sample schedules into the original parameter model to obtain a training result, and the original parameter model is a neural network model.
可选的,原始参数模型是根据神经网络模型建立的,比如:原始参数模型是根据DNN模型建立的。Optionally, the original parameter model is established according to the neural network model, for example: the original parameter model is established according to the DNN model.
示意性的,对于每组样本数据组,计算机设备创建该组样本数据组对应的输入输出对,输入输出对的输入参数为该组样本数据组中的至少两个样本调度,目标参数为该组样本数据组中的正确代价对比结果;计算机设备将输入参数输入原始参数模型,得到训练结果。Schematically, for each sample data set, the computer device creates an input-output pair corresponding to the set of sample data sets, the input parameter of the input-output pair is at least two sample schedules in the set of sample data sets, and the target parameter is the set The correct cost comparison results in the sample data set; the computer equipment inputs the input parameters into the original parameter model to obtain the training result.
可选的,输入输出对通过特征向量表示。Optionally, input-output pairs are represented by feature vectors.
步骤604,将训练结果与正确代价对比结果进行比较,得到计算损失,计算损失用于指示训练结果与正确代价对比结果之间的误差。 Step 604, comparing the training result with the correct cost comparison result to obtain a calculation loss, which is used to indicate the error between the training result and the correct cost comparison result.
可选地,训练结果为原始参数模型输出的编码信息,正确代价对比结果为预先标注的编码信息。比如,编码信息为采用独热码(One-Hot)编码的信息。Optionally, the training result is the coding information output by the original parameter model, and the correct cost comparison result is the pre-marked coding information. For example, the encoded information is information encoded by one-hot code (One-Hot).
可选的,计算损失通过交叉熵来表示。Optionally, the calculation loss is represented by cross entropy.
步骤605,根据至少一组样本数据组各自对应的计算损失,采用误差反向传播算法训练得到代价比较模型。Step 605: According to the calculated losses corresponding to at least one set of sample data sets, use the error back propagation algorithm to train to obtain a cost comparison model.
可选的,计算机设备通过反向传播算法根据计算损失确定代价比较模型的梯度方向,从代价比较模型的输出层逐层向前更新代价比较模型中的模型参数。Optionally, the computer device determines the gradient direction of the cost comparison model according to the calculation loss through the back propagation algorithm, and updates the model parameters in the cost comparison model layer by layer from the output layer of the cost comparison model.
在一个示意性的例子中,以至少两个候选调度为调度A和调度B为例,如图7所示。计算机设备从调度空间中抽取两个调度即调度A和B,作为代价比较模型训练的输入数据。对比两个调度在目标硬件平台上的相对执行时长,采用独热码编码,生成(A,B)输入的编码信息(即正确代价对比结果)作为反向传播算法的目标参数,编码信息如表一所示。编码信息为第一数值时用于指示调度A的执行时长小于调度B的执行时长,编码信息为第二数值时用于指示调度A的执行时长等于调度B的执行时长,编码信息为第三数值时用于指示调度A的执行时长大于调度B的执行时长。In an illustrative example, take at least two candidate schedules as schedule A and schedule B as an example, as shown in FIG. 7 . The computer equipment extracts two schedules from the schedule space, that is, schedules A and B, as input data for cost comparison model training. Compare the relative execution time of the two schedules on the target hardware platform, and use one-hot encoding to generate (A, B) input encoding information (that is, the correct cost comparison result) as the target parameter of the backpropagation algorithm. The encoding information is shown in the table one shown. When the encoding information is the first value, it is used to indicate that the execution duration of schedule A is less than the execution duration of schedule B; when the encoding information is the second value, it is used to indicate that the execution duration of schedule A is equal to the execution duration of schedule B, and the encoding information is the third value When is used to indicate that the execution duration of schedule A is longer than that of schedule B.
表一Table I
计算机设备对调度A和调度B进行特征提取得到各自对应的特征矩阵。比如,调度A和调度B的特征矩阵为两个250x57维的矩阵。对特征矩阵中的部分列数据进行归一化处理以限制其动态范围,归一化函数的公式如下:The computer equipment performs feature extraction on schedule A and schedule B to obtain respective corresponding feature matrices. For example, the feature matrices of scheduling A and scheduling B are two 250x57-dimensional matrices. Part of the column data in the feature matrix is normalized to limit its dynamic range. The formula of the normalization function is as follows:
其中,v为输入数据,v*为输出数据。示意性的,归一化函数的输入输出曲线如图8所示。其中,横坐标为上述的输入数据,纵坐标为上述的输出数据。Among them, v is the input data, and v* is the output data. Schematically, the input-output curve of the normalization function is shown in FIG. 8 . Wherein, the abscissa is the above-mentioned input data, and the ordinate is the above-mentioned output data.
计算机设备将经过归一化处理的调度A输入至多层感知器构成的特征嵌入模块A即DNN_A,输出一个1x512维的调度嵌入(schedule embedding)A,并将经过归一化处理的调度B输入至多层感知器构成的特征嵌入模块B即DNN_B,输出一个1x512维的调度嵌入B。这两个调度嵌入按位减,即将调度嵌入A减去调度嵌入B得到调度差嵌入(schedule difference embedding)。将调度差嵌入输入至深度网络判别模块即DNN_CLS中,输出得到训练结果即三个数字的编码信息。根据深度网络判别模块的输出数据以及调度A和调度B的真实标签即正确代价对比结果,以均方差损失函数(或最小平方差函数)作为损失函数,计算获得模型针对当前输入的计算损失。将计算损失通过梯度下降法反向传播,更新DNN_A、DNN_B、DNN_CLS等基于神经网络模块的模型参数。重复上述的步骤,对训练样本集进行多期(比如30期)训练至模型收敛。其中,DNN_A、DNN_B、DNN_CLS的网络结构可以是端到端的堆叠的多层感知器架构,网络结构如图9所示。其中,数字代表各层神经元的个数,每个全连接层之间由Relu函数作为激活函数。The computer equipment inputs the normalized schedule A to the feature embedding module A composed of the multi-layer perceptron, that is, DNN_A, outputs a 1x512-dimensional schedule embedding (schedule embedding) A, and inputs the normalized schedule B to at most The feature embedding module B composed of layer perceptrons is DNN_B, which outputs a 1x512-dimensional scheduling embedding B. The two schedule embeddings are bitwise subtracted, that is, the schedule embedding A is subtracted from the schedule embedding B to obtain a schedule difference embedding. Embed the scheduling difference into the deep network discriminant module, DNN_CLS, and output the training result, which is the encoded information of three numbers. According to the output data of the deep network discriminant module and the real labels of schedule A and schedule B, that is, the correct cost comparison result, the mean square error loss function (or minimum square error function) is used as the loss function to calculate the calculation loss of the model for the current input. The calculation loss is backpropagated through the gradient descent method, and the model parameters based on the neural network module such as DNN_A, DNN_B, and DNN_CLS are updated. Repeat the above steps to train the training sample set for multiple periods (for example, 30 periods) until the model converges. Wherein, the network structure of DNN_A, DNN_B, and DNN_CLS may be an end-to-end stacked multi-layer perceptron architecture, and the network structure is shown in FIG. 9 . Among them, the number represents the number of neurons in each layer, and the Relu function is used as the activation function between each fully connected layer.
基于上述训练完成的代价比较模型,请参考图10,其示出了本申请另一个示例性实施例提供的算子的调度运行时间比较方法的流程图。本实施例以该算子的调度运行时间比较方法应用于图3所示出的计算机设备来举例说明。该算子的调度运行时间比较方法包括:Based on the cost comparison model completed by the above training, please refer to FIG. 10 , which shows a flowchart of a method for comparing scheduling runtimes of operators provided by another exemplary embodiment of the present application. This embodiment is described by taking the method for comparing the scheduling running time of the operator applied to the computer device shown in FIG. 3 as an example. The scheduling running time comparison methods of this operator include:
步骤1001,从目标计算表达对应的调度空间中获取至少两个候选调度。 Step 1001, obtain at least two candidate schedules from the schedule space corresponding to the target computation expression.
其中,目标计算表达用于描述算子的计算逻辑,候选调度为基于目标计算表达生成的算子在目标硬件平台上的可执行代码。Among them, the target computing expression is used to describe the computing logic of the operator, and the candidate schedule is the executable code of the operator generated based on the target computing expression on the target hardware platform.
可选地,计算机设备获取输入的目标计算表达,对该目标计算表达进行分析,根据预设方式生成模板,确定调度空间,该调度空间包括对模板实例化生成的至少两个候选调度。计算机设备从目标计算表达对应的调度空间中获取至少两个候选调度。Optionally, the computer device acquires an input target computing expression, analyzes the target computing expression, generates a template according to a preset method, and determines a scheduling space, where the scheduling space includes at least two candidate schedulings generated by instantiating the template. The computer device acquires at least two candidate schedules from the schedule space corresponding to the target computation expression.
示意性的,调度空间中包括n个候选调度,一种可能的实现方式中,采取每次两两比较,保留最优的方式比较n-1次而获得最优的目标调度。另一种可能的实现方式中,选择二分法比较,比如n为8,即将8个调度两两分为4组,4组中通过代价比较模型选取4个运行速度最快的候选调度进行二次分组,二次分组分为2组,需要做2 次比较,比较完成后保留2个最优的候选调度进行最后比较,从而获得8个候选调度中最优的目标调度。本申请实施例对调度的分组比较方式不加以限定。Schematically, the scheduling space includes n candidate schedulings. In a possible implementation, each pairwise comparison is adopted, and the optimal method is kept for n-1 comparisons to obtain the optimal target scheduling. In another possible implementation, choose the dichotomous method for comparison, for example, n is 8, that is, divide the 8 schedules into 4 groups in pairs, and select 4 candidate schedules with the fastest running speed from the 4 groups through the cost comparison model for secondary Grouping, the secondary grouping is divided into 2 groups, and 2 comparisons are required. After the comparison is completed, the 2 optimal candidate schedulings are reserved for final comparison, so as to obtain the optimal target scheduling among the 8 candidate schedulings. The embodiment of the present application does not limit the scheduling group comparison method.
步骤1002,对于至少两个候选调度中的每个候选调度,对候选调度进行特征提取得到特征矩阵。Step 1002, for each of the at least two candidate schedules, perform feature extraction on the candidate schedules to obtain a feature matrix.
可选地,对于至少两个候选调度中的每个候选调度,计算机设备从该候选调度的m个循环中提取多类信息组合成向量,即为候选调度对应的特征矩阵,m为正整数。比如,组合成的向量大小为1x57。最多支持250个循环的信息,最终组装成一个250x57大小的二维特征矩阵,支持循环的数量可以按实际需求变化,本申请实施例对此不加以限定。Optionally, for each of the at least two candidate schedules, the computer device extracts multiple types of information from m cycles of the candidate schedule and combines them into a vector, which is a feature matrix corresponding to the candidate schedule, where m is a positive integer. For example, the combined vector size is 1x57. A maximum of 250 loops of information are supported, and finally assembled into a 250x57 two-dimensional feature matrix. The number of supported loops can vary according to actual needs, which is not limited in this embodiment of the present application.
可选地,特征矩阵用于指示循环信息、输入数据形状信息、计算编码、轴类型编码和数据访问类型编码中的至少一种。Optionally, the feature matrix is used to indicate at least one of cycle information, input data shape information, calculation encoding, axis type encoding, and data access type encoding.
循环信息包括与候选调度的循环计算逻辑相关的信息。可选地,循环信息为调度中层级的循环信息,比如循环信息的大小为1x6。其中,循环信息包括:循环深度、嵌套层级、块编号、用于指示是否为最后一个循环的标识、循环深度的商、循环深度的余数中的至少一个信息。其中循环深度和循环深度的商需要进行归一化处理。The round-robin information includes information related to the round-robin calculation logic of the candidate schedule. Optionally, the cycle information is cycle information at a level in the scheduling, for example, the size of the cycle information is 1x6. Wherein, the loop information includes: at least one of loop depth, nesting level, block number, flag indicating whether it is the last loop, quotient of loop depth, and remainder of loop depth. Among them, the quotient of the loop depth and the loop depth needs to be normalized.
输入数据形状信息用于描述算子的输入数据。比如输入数据形状信息的大小为1x10。该算子为单输入算子或者双输入算子或者多输入的算子。输入数据形状信息包括:k输入数据各自对应的形状信息,k为正整数,形状信息包括批大小、通道数、高度、宽度、最小通道数中的至少一个信息。The input data shape information is used to describe the input data of the operator. For example, the size of the input data shape information is 1x10. The operator is a single-input operator, a double-input operator, or a multi-input operator. The shape information of the input data includes: shape information corresponding to k input data, k is a positive integer, and the shape information includes at least one of batch size, number of channels, height, width, and minimum number of channels.
计算编码包括候选调度的当前循环里用到的计算指令的编码。比如计算编码的大小为1x6。计算编码包括:内存访问类型、程序指令、数据类型、存储单元、用于指示是否使用双缓存的标识中的至少一种信息。The computation encoding includes the encoding of the computation instruction used in the current cycle of the candidate schedule. For example, the size of the calculation code is 1x6. The computing code includes: at least one of memory access types, program instructions, data types, storage units, and identifiers for indicating whether to use double buffering.
轴类型编码包括对轴进行操作的类型编码。比如轴类型编码的大小为1x15。轴类型编码用于指示扩展、归一轴中的至少一种操作。Axis type encodings include encodings for the types of operations on the axes. For example, the size of the axis type code is 1x15. Axis type codes are used to indicate at least one operation among extended, normalized axes.
数据访问类型编码包括对数据进行访问的类型编码。比如数据访问类型编码的大小为1x19。数据访问类型编码用于指示写数据、读数据、分配、编译指示中的至少一种访问。The data access type encoding includes the type encoding of the access to the data. For example, the size of the data access type encoding is 1x19. The data access type code is used to indicate at least one access among write data, read data, allocation, and pragma.
在一个示意性的例子中,对候选调度进行特征提取得到特征矩阵的,特征矩阵的数据结构如图11所示。从候选调度的每一个循环中提取多类信息组合成向量,组合成的向量大小为1x57,最多支持250个循环的信息,最终组装成一个250x57大小的二维特征矩阵,其中特征矩阵用于指示循环信息、输入数据形状信息、计算编码、轴类型编码和数据访问类型编码,循环信息的大小为1x6,输入数据形状信息的大小为1x10,0,计算编码的大小为1x6,轴类型编码的大小为1x15,数据访问类型编码的大小为1x20。In a schematic example, feature extraction is performed on candidate schedules to obtain a feature matrix, and the data structure of the feature matrix is shown in FIG. 11 . Extract multiple types of information from each cycle of candidate scheduling and combine them into vectors. The size of the combined vector is 1x57, and it supports up to 250 cycles of information. Finally, it is assembled into a two-dimensional feature matrix with a size of 250x57, where the feature matrix is used to indicate Loop information, input data shape information, calculation encoding, axis type encoding and data access type encoding, the size of loop information is 1x6, the size of input data shape information is 1x10, 0, the size of calculation encoding is 1x6, the size of axis type encoding is 1x15, and the size of the data access type encoding is 1x20.
需要说明的是,除了本申请实施例提供的特征提取、映射方式及数据结构外,其他调度表达方式同样可以作为代价比较模型的输入。本申请实施例对输入的数据结构不做限定。It should be noted that, in addition to the feature extraction, mapping methods, and data structures provided by the embodiments of the present application, other scheduling expression methods can also be used as the input of the cost comparison model. The embodiment of the present application does not limit the input data structure.
步骤1003,对于至少两个候选调度中的每个候选调度,对候选调度对应的特征矩阵进行归一化处理,得到预处理后的候选调度。 Step 1003 , for each of the at least two candidate schedules, normalize the feature matrix corresponding to the candidate schedules to obtain a preprocessed candidate schedule.
步骤1004,将预处理后的至少两个候选调度输入至训练完成的代价比较模型中,输出得到代价对比结果,代价对比结果用于指示至少两个候选调度在目标硬件平台上的执行时长的大小排序。 Step 1004, input the preprocessed at least two candidate schedules into the trained cost comparison model, output the cost comparison result, and the cost comparison result is used to indicate the execution duration of the at least two candidate schedules on the target hardware platform Sort.
可选地,计算机设备获取训练好的代价比较模型,代价比较模型为采用多个样本调度对神经网络进行训练得到的模型。计算机设备将预处理后的至少两个候选调度输入至训练完成的代价比较模型中,输出得到代价对比结果,代价对比结果用于指示至少两个候选调度在目标硬件平台上的执行时长的大小排序Optionally, the computer device acquires a trained cost comparison model, and the cost comparison model is a model obtained by training a neural network by using multiple sample scheduling. The computer device inputs the preprocessed at least two candidate schedules into the trained cost comparison model, outputs a cost comparison result, and the cost comparison result is used to indicate the order of the execution time of the at least two candidate schedules on the target hardware platform
计算机设备调用代价比较模型的过程可参考上述实施例中的相关细节,在此不再赘述。For the process of invoking the cost comparison model by the computer device, reference may be made to relevant details in the foregoing embodiments, which will not be repeated here.
可选地,计算机设备将至少两个候选调度和代价对比结果添加至训练样本集,得到更新后的训练样本集;根据更新后的训练样本集对代价比较模型进行训练,得到更新后的代价比较模型。Optionally, the computer device adds at least two candidate scheduling and cost comparison results to the training sample set to obtain an updated training sample set; train the cost comparison model according to the updated training sample set to obtain an updated cost comparison Model.
在一个示意性的例子中,以至少两个候选调度为调度A和调度B为例,如图12所示。计算机设备从调度空间中抽取两个调度A和调度B,对调度A和调度B进行特征提取得到各自对应的特征矩阵。比如,调度A和调度B的特征矩阵为两个250x57维的矩阵。对特征矩阵中的部分列数据进行归一化处理以限制其动态范围,归一化处理的方式可类比参考上述模型训练过程中的归一化处理的相关描述,在此不再赘述。计算机设备将经过归一化处理的调度A输入至多层感知器构成的特征嵌入模块A即DNN_A,输出一个1x512维的调度嵌入A,并将经过归一化处理的调度B输入至多层感知器构成的特征嵌入模块B即DNN_B,输出一个1x512维的调度嵌入B。这两个调度嵌入按位减,即将调度嵌入A减去调度嵌入B得到调度差嵌入。将调度差嵌入输入至深度网络判别模块即DNN_CLS中,输出得到代价对比结果即三个数字的编码结果。其中,DNN_A、DNN_B、DNN_CLS的网络结构可类比参考上述模型训练过程中的相关描述,在此不再赘述。计算机设备将输出的三个数字的编码信息转化为独热码的标签格式。In an illustrative example, take at least two candidate schedules as schedule A and schedule B as an example, as shown in FIG. 12 . The computer equipment extracts two schedules A and B from the schedule space, and extracts features from the schedule A and schedule B to obtain their corresponding feature matrices. For example, the feature matrices of scheduling A and scheduling B are two 250x57-dimensional matrices. Part of the column data in the feature matrix is normalized to limit its dynamic range. The way of normalization can be compared with the description of normalization in the above model training process, and will not be repeated here. The computer equipment inputs the normalized scheduling A to the feature embedding module A composed of the multi-layer perceptron, that is, DNN_A, outputs a 1x512-dimensional scheduling embedding A, and inputs the normalized scheduling B to the multi-layer perceptron. The feature embedding module B of DNN_B outputs a 1x512-dimensional scheduling embedding B. The two scheduling embeddings are bitwise subtracted, that is, the scheduling embedding A is subtracted from the scheduling embedding B to obtain the scheduling difference embedding. Embed the scheduling difference into the deep network discriminant module, DNN_CLS, and output the cost comparison result, which is the encoding result of three numbers. Wherein, the network structure of DNN_A, DNN_B, and DNN_CLS can refer to the relevant description in the above model training process by analogy, and will not be repeated here. The computer equipment converts the outputted three-digit encoded information into a one-hot coded label format.
综上所述,本申请实施例还通过对至少两个候选调度进行特征提取,将调度映射为其唯一对应的矩阵表达形式,获得至少两个候选调度的特征矩阵表达;对两个特征矩阵表达做归一化处理;基于深度神经网络的代价比较模型以预处理过后的至少两个特征矩阵表达为输入,输出为预测的至少两个候选调度的执行时长的比较结果的编码信息;对代价比较模型输出的编码信息进行解码,得到至少两个候选调度的执行时长的比较结果,即通过深度学习网络模型比较同一个计算表达的不同调度实现在特定硬件平台上的执行时长,从而替代调度实现经过编译流程后在硬件上运行并测量的过程,解决自动优化器/编译器等算子自动优化系统在大规模搜索中速度慢的问题。To sum up, the embodiment of the present application also performs feature extraction on at least two candidate schedules, maps the schedules to its unique corresponding matrix expression form, and obtains the feature matrix expressions of at least two candidate schedules; for the two feature matrix expressions Do normalization processing; the cost comparison model based on the deep neural network takes at least two preprocessed feature matrices as input, and the output is the coding information of the comparison result of the predicted execution time of at least two candidate schedules; the cost comparison The encoded information output by the model is decoded to obtain the comparison result of the execution time of at least two candidate schedules, that is, the execution time of different schedule implementations of the same calculation expression on a specific hardware platform is compared through the deep learning network model, thereby replacing the schedule implementation process. The process of running and measuring on the hardware after the compilation process solves the problem of slow speed in large-scale search of operator automatic optimization systems such as automatic optimizers/compilers.
在一个示意性的例子中,对代价比较模型的实施以预测算子执行时长的快慢为目标。训练样本集包括源自于32个算子的20792个调度,每个算子包含的调度的个数不相同。对属于同一个算子的调度,进行两两组合配对形成训练实例集,配对后对比两个算子的执行时长,并按照上述的相关方法生成配对后的训练实例的目标。例抽取调度A和调度B,调度A的实际执行时长为15秒,调度B的实际执行时长为17秒,则 (A,B)为一个训练实例,15秒时间少于17秒,(A,B)这个训练实例的目标编码为001。属于同一个算子的调度两两组合,在训练样本集中,可以包括一个调度与该调度本身的组合,所形成的训练实例的目标编码为010。属于同一个算子调度两两组合,组合对顺序敏感,如(A,B)组合不同于(B,A)组合,如果A和B的执行时长不同,(A,B)组合和(B,A)组合的目标编码也不同。若某算子中包含N(N>2)个调度,则两两组合可以组成N的平方个训练实例。这种组合训练,即便训练数据数量相对有限,也可以构建出比较大的训练数据集。在该例子中存在20792个调度,共组成了4900万个训练实例及其目标编码用于训练模型。模型结构如上所述,在此不再赘述。神经网络模型采用批训练,每次迭代输入5000个训练实例,学习率设定为10e-8,采用动量随机梯度下降法对完整的训练实例集训练多期(比如30期)。测试集中包括46022个测试实例,每个测试实例由归属于同一个算子的两个调度构成,用于生成测试实例的任何调度均不包含于生成训练实例的调度集合中。测试目标编码为测试实例通过上述的相关方法生成,网络输出的预测结果经过最大值参数(argmax)函数后如果与测试目标编码完全吻合则记为测试实例被网络正确预测。准确率定义为:被网络正确预测的测试实例个数/被测试的测试实例总数。通过对46022个测试实例的测试,该方法正确预测41242个测试实例,准确率达89.61%。通过增加训练调度的数量和对网络结构进行优化,可以进一步提高模型的准确率。In an illustrative example, the cost comparison model is implemented with the goal of predicting how fast operators will take to execute. The training sample set includes 20792 schedules from 32 operators, and each operator contains different schedules. For scheduling belonging to the same operator, perform pairwise pairing to form a training instance set, compare the execution time of the two operators after pairing, and generate the target of the paired training instance according to the above-mentioned related method. Example Extract schedule A and schedule B, the actual execution time of schedule A is 15 seconds, and the actual execution time of schedule B is 17 seconds, then (A, B) is a training instance, and the time of 15 seconds is less than 17 seconds, (A, B) The target encoding for this training instance is 001. The pairwise combination of schedules belonging to the same operator may include a combination of a schedule and the schedule itself in the training sample set, and the target code of the formed training instance is 010. They belong to the same operator scheduling combination, and the combination is sensitive to the order. For example, the (A, B) combination is different from the (B, A) combination. If the execution time of A and B is different, the (A, B) combination and (B, A) The combined target encodings are also different. If an operator contains N (N>2) schedules, then the combination of two pairs can form N square training instances. This combination of training, even if the amount of training data is relatively limited, can also build a relatively large training data set. In this example, there are 20,792 schedules, and a total of 49 million training instances and their target codes are used to train the model. The model structure is as described above and will not be repeated here. The neural network model adopts batch training, 5000 training examples are input for each iteration, the learning rate is set to 10e-8, and the momentum stochastic gradient descent method is used to train the complete training example set for multiple periods (for example, 30 periods). The test set includes 46022 test instances, and each test instance is composed of two schedules belonging to the same operator. Any schedule used to generate test instances is not included in the schedule set for generating training instances. The test target code is generated by the above-mentioned related method for the test instance. After the prediction result output by the network passes the maximum parameter (argmax) function, if it completely matches the test target code, it is recorded as the test instance is correctly predicted by the network. Accuracy is defined as: the number of test instances correctly predicted by the network/the total number of test instances tested. Tested on 46022 test cases, the method correctly predicts 41242 test cases with an accuracy rate of 89.61%. By increasing the number of training schedules and optimizing the network structure, the accuracy of the model can be further improved.
综上所述,本申请实施例提供了一种算子的调度运行时间比较方法,采用代价比较的思想来确定至少两个调度的相对执行时长的比较结果,将代价比较模型应用于算子自动优化系统的调优过程中,还涉及一种可应用于算子自动优化系统中的代价比较模型的建模方法,包括模型的架构设计,模型训练和模型推理应用过程,并在模型训练和模型推理应用过程中,调度可以通过特征提取转化为专门的数据结构,并对数据的归一化处理及对输出格式的表达,具有准确度高、推理速度快、所需要的训练成本相对现存方法较低的优点。也就是说,一方面保证了代价比较模型较高的准确率;另一方面,提高了代价比较模型的推理速度,对比一组实例仅需要3毫秒;另一方面,代价比较模型训练所需的数据量和算力相对较少,GPU单卡上70个小时完成对4900多万训练实例的30期训练。通过代价比较模型,代码优化器/编译器自动调优只需要考虑如何提高代价比较模型的准确率即可,相较于相关技术中预测调度绝对执行时长的代价模型,除了模型预测的准确度外,相关技术中的代价模型还需要考虑如何应对误差带来的边界问题,例如:预测出两个调度运行时间的差异小于模型预测的误差,此时绝对值模型无法给出高确信度的预测。To sum up, the embodiment of the present application provides a scheduling running time comparison method for operators, which adopts the idea of cost comparison to determine the comparison result of the relative execution time of at least two schedules, and applies the cost comparison model to the operator automatic The optimization process of the optimization system also involves a modeling method of the cost comparison model that can be applied to the operator automatic optimization system, including the model architecture design, model training and model reasoning application process, and the model training and model In the process of inference application, scheduling can be converted into a special data structure through feature extraction, and the normalization processing of data and the expression of output format have high accuracy, fast inference speed, and the required training cost is lower than that of existing methods. low pros. That is to say, on the one hand, the higher accuracy of the cost comparison model is guaranteed; on the other hand, the reasoning speed of the cost comparison model is improved, and it only takes 3 milliseconds to compare a set of instances; on the other hand, the cost comparison model training requires The amount of data and computing power are relatively small, and 30 sessions of training on more than 49 million training instances are completed in 70 hours on a single GPU card. Through the cost comparison model, the code optimizer/compiler automatic tuning only needs to consider how to improve the accuracy of the cost comparison model. Compared with the cost model that predicts the absolute execution time of scheduling in related technologies, in addition to the accuracy of model prediction , the cost model in related technologies also needs to consider how to deal with the boundary problems caused by errors, for example: if the difference between the predicted running times of two schedules is smaller than the error predicted by the model, the absolute value model cannot give a high-confidence prediction at this time.
下述为本申请装置实施例,可以用于执行本申请方法实施例。对于本申请装置实施例中未披露的细节,请参照本申请方法实施例。The following are device embodiments of the present application, which can be used to implement the method embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application.
请参考图13,其示出了本申请一个示例性实施例提供的算子的调度运行时间比较装置的框图。该装置可以通过软件、硬件或者两者的结合实现成为图3提供的计算机设备的全部或者一部分。该装置可以包括:第一获取单元1310、第二获取单元1320和调用单元1330。Please refer to FIG. 13 , which shows a block diagram of an apparatus for comparing scheduled runtimes of operators provided by an exemplary embodiment of the present application. The apparatus can be implemented as all or a part of the computer equipment provided in FIG. 3 through software, hardware or a combination of the two. The apparatus may include: a first obtaining unit 1310 , a second obtaining unit 1320 and a calling unit 1330 .
第一获取单元1310,用于获取目标计算表达对应的至少两个候选调度,目标计算 表达用于描述算子的计算逻辑,候选调度为基于目标计算表达生成的算子的可执行代码;The first acquisition unit 1310 is configured to acquire at least two candidate schedules corresponding to the target calculation expression, the target calculation expression is used to describe the calculation logic of the operator, and the candidate schedule is the executable code of the operator generated based on the target calculation expression;
第二获取单元1320,用于获取代价比较模型,代价比较模型为采用多个样本调度对神经网络进行训练得到的模型;The second acquisition unit 1320 is configured to acquire a cost comparison model, where the cost comparison model is a model obtained by training a neural network using multiple sample scheduling;
调用单元1330,用于根据至少两个候选调度,调用代价比较模型输出得到代价对比结果,代价对比结果用于指示至少两个候选调度的执行时长的大小排序。The calling unit 1330 is configured to call the output of the cost comparison model according to the at least two candidate schedules to obtain a cost comparison result, and the cost comparison result is used to indicate the order of the execution durations of the at least two candidate schedules.
在一种可能的实现方式中,调用单元1330,还用于:In a possible implementation manner, the calling unit 1330 is also used to:
对至少两个候选调度进行预处理,得到预处理后的至少两个候选调度;Preprocessing the at least two candidate schedules to obtain at least two preprocessed candidate schedules;
将预处理后的至少两个候选调度输入至代价比较模型中,输出得到代价对比结果;Input the preprocessed at least two candidate schedules into the cost comparison model, and output the cost comparison result;
其中,代价比较模型是根据至少一组样本数据组训练得到的,每组样本数据组包括:样本计算表达对应的至少两个样本调度和预先标注的正确代价对比结果。Wherein, the cost comparison model is trained according to at least one set of sample data sets, and each set of sample data sets includes: at least two sample schedules corresponding to sample calculation expressions and pre-marked correct cost comparison results.
在另一种可能的实现方式中,调用单元1330,还用于:In another possible implementation manner, the calling unit 1330 is also used to:
对于至少两个候选调度中的每个候选调度,对候选调度进行特征提取得到特征矩阵;For each of the at least two candidate schedules, performing feature extraction on the candidate schedules to obtain a feature matrix;
对候选调度对应的特征矩阵进行归一化处理,得到预处理后的候选调度。The feature matrix corresponding to the candidate schedule is normalized to obtain the preprocessed candidate schedule.
在另一种可能的实现方式中,特征矩阵用于指示循环信息、输入数据形状信息、计算编码、轴类型编码和数据访问类型编码中的至少一种,循环信息包括与候选调度的循环计算逻辑相关的信息,输入数据形状信息用于描述算子的输入数据,计算编码包括候选调度的当前循环里用到的计算指令的编码,轴类型编码包括对轴进行操作的类型编码,数据访问类型编码包括对数据进行访问的类型编码。In another possible implementation, the feature matrix is used to indicate at least one of cycle information, input data shape information, calculation encoding, axis type encoding, and data access type encoding, and the cycle information includes the cycle calculation logic of the candidate schedule Related information, the input data shape information is used to describe the input data of the operator, the calculation code includes the code of the calculation instruction used in the current cycle of the candidate schedule, the axis type code includes the type code for operating on the axis, and the data access type code Include type encodings for accessing data.
在另一种可能的实现方式中,装置还包括训练单元;训练单元用于:In another possible implementation, the device further includes a training unit; the training unit is used for:
获取训练样本集,训练样本集包括至少一组样本数据组;Obtain a training sample set, where the training sample set includes at least one set of sample data sets;
对于每组样本数据组,对至少两个样本调度进行预处理得到预处理后的至少两个样本调度;For each sample data group, at least two sample schedules are preprocessed to obtain at least two sample schedules after preprocessing;
将预处理后的至少两个样本调度输入原始参数模型得到训练结果,原始参数模型为神经网络模型;inputting the preprocessed at least two sample schedules into the original parameter model to obtain the training result, and the original parameter model is a neural network model;
将训练结果与正确代价对比结果进行比较,得到计算损失,计算损失用于指示训练结果与正确代价对比结果之间的误差;Comparing the training result with the correct cost comparison result to obtain the calculated loss, which is used to indicate the error between the training result and the correct cost comparison result;
根据至少一组样本数据组各自对应的计算损失,采用误差反向传播算法训练得到代价比较模型。According to the calculated losses corresponding to at least one set of sample data sets, an error backpropagation algorithm is used to train the cost comparison model.
在另一种可能的实现方式中,该装置还包括更新单元;更新单元用于:In another possible implementation, the device further includes an update unit; the update unit is used for:
将至少两个候选调度和代价对比结果添加至训练样本集,得到更新后的训练样本集;Adding at least two candidate scheduling and cost comparison results to the training sample set to obtain an updated training sample set;
根据更新后的训练样本集对代价比较模型进行训练,得到更新后的代价比较模型。The cost comparison model is trained according to the updated training sample set to obtain an updated cost comparison model.
需要说明的是,上述实施例提供的装置,在实现其功能时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的装置与方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。It should be noted that, when realizing the functions of the device provided by the above-mentioned embodiments, the division of the above-mentioned functional modules is used as an example for illustration. In practical applications, the above-mentioned function allocation can be completed by different functional modules according to the needs. The internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the device and the method embodiment provided by the above embodiment belong to the same idea, and the specific implementation process thereof is detailed in the method embodiment, and will not be repeated here.
本申请实施例提供了一种算子的调度运行时间比较装置,该算子的调度运行时间比较装置包括:处理器;用于存储处理器可执行指令的存储器;其中,处理器被配置为执行指令时实现上述实施例中由计算机设备执行的方法。An embodiment of the present application provides an operator scheduling runtime comparison device, the operator scheduling runtime comparison device includes: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to execute The instructions implement the methods executed by the computer device in the above-mentioned embodiments.
本申请实施例提供了一种计算机程序产品,包括计算机可读代码,或者承载有计算机可读代码的非易失性计算机可读存储介质,当计算机可读代码在处理器中运行时,处理器执行上述实施例中由计算机设备执行的方法。An embodiment of the present application provides a computer program product, including computer-readable codes, or a non-volatile computer-readable storage medium carrying computer-readable codes. When the computer-readable codes run in a processor, the processor Execute the method performed by the computer device in the foregoing embodiments.
本申请实施例提供了一种非易失性计算机可读存储介质,其上存储有计算机程序指令,计算机程序指令被处理器执行时实现上述实施例中由计算机设备执行的方法。An embodiment of the present application provides a non-volatile computer-readable storage medium, on which computer program instructions are stored. When the computer program instructions are executed by a processor, the methods performed by the computer device in the foregoing embodiments are implemented.
计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是――但不限于――电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(Random Access Memory,RAM)、只读存储器(Read Only Memory,ROM)、可擦式可编程只读存储器(Electrically Programmable Read-Only-Memory,EPROM或闪存)、静态随机存取存储器(Static Random-Access Memory,SRAM)、便携式压缩盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、数字多功能盘(Digital Video Disc,DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。A computer readable storage medium may be a tangible device that can retain and store instructions for use by an instruction execution device. A computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (non-exhaustive list) of computer-readable storage media include: portable computer disk, hard disk, random access memory (Random Access Memory, RAM), read only memory (Read Only Memory, ROM), erasable Electrically Programmable Read-Only-Memory (EPROM or flash memory), Static Random-Access Memory (Static Random-Access Memory, SRAM), Portable Compression Disk Read-Only Memory (Compact Disc Read-Only Memory, CD -ROM), Digital Video Disc (DVD), memory sticks, floppy disks, mechanically encoded devices such as punched cards or raised structures in grooves with instructions stored thereon, and any suitable combination of the foregoing .
这里所描述的计算机可读程序指令或代码可以从计算机可读存储介质下载到各个计算/处理设备,或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令,并转发该计算机可读程序指令,以供存储在各个计算/处理设备中的计算机可读存储介质中。Computer readable program instructions or codes described herein may be downloaded from a computer readable storage medium to a respective computing/processing device, or downloaded to an external computer or external storage device over a network, such as the Internet, local area network, wide area network, and/or wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or a network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
用于执行本申请操作的计算机程序指令可以是汇编指令、指令集架构(Instruction Set Architecture,ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码,所述编程语言包括面向对象的编程语言—诸如Smalltalk、C++等,以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络—包括局域网(Local Area Network,LAN)或广域网(Wide Area Network,WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中,通过利用计算机可读程序指令的状态信息来个性化定制电子电路,例如可编程逻辑电路、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或可编程逻辑阵列(Programmable Logic Array,PLA),该电子电路可以执行计算机可读程序指令,从而实现本申请的各个方面。Computer program instructions for performing the operations of the present application may be assembly instructions, instruction set architecture (Instruction Set Architecture, ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, state setting data, or in one or more source or object code written in any combination of programming languages, including object-oriented programming languages—such as Smalltalk, C++, etc., and conventional procedural programming languages—such as the “C” language or similar programming languages. Computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement. In cases involving a remote computer, the remote computer can be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or it can be connected to an external computer such as use an Internet service provider to connect via the Internet). In some embodiments, electronic circuits, such as programmable logic circuits, field-programmable gate arrays (Field-Programmable Gate Array, FPGA) or programmable logic arrays (Programmable Logic Array, PLA), the electronic circuit can execute computer-readable program instructions, thereby realizing various aspects of the present application.
这里参照根据本申请实施例的方法、装置(系统)和计算机程序产品的流程图和/或框图描述了本申请的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。Aspects of the present application are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It should be understood that each block of the flowcharts and/or block diagrams, and combinations of blocks in the flowcharts and/or block diagrams, can be implemented by computer-readable program instructions.
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理器,从而生产出一种机器,使得这些指令在通过计算机或其它可编程数据处理装置的处理器执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processor of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in flowcharts and/or block diagrams.
也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上,使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。It is also possible to load computer-readable program instructions into a computer, other programmable data processing device, or other equipment, so that a series of operational steps are performed on the computer, other programmable data processing device, or other equipment to produce a computer-implemented process , so that instructions executed on computers, other programmable data processing devices, or other devices implement the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.
附图中的流程图和框图显示了根据本申请的多个实施例的装置、系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,所述模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。The flowchart and block diagrams in the figures show the architecture, functions and operations of possible implementations of apparatuses, systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in a flowchart or block diagram may represent a module, a portion of a program segment, or an instruction that includes one or more Executable instructions. In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行相应的功能或动作的硬件(例如电路或ASIC(Application Specific Integrated Circuit,专用集成电路))来实现,或者可以用硬件和软件的组合,如固件等来实现。It should also be noted that each block in the block diagrams and/or flowcharts, and combinations of blocks in the block diagrams and/or flowcharts, can be implemented with hardware (such as circuits or ASIC (Application Specific Integrated Circuit, application-specific integrated circuit)), or it can be realized by a combination of hardware and software, such as firmware.
尽管在此结合各实施例对本申请进行了描述,然而,在实施所要求保护的本申请过程中,本领域技术人员通过查看所述附图、公开内容、以及所附权利要求书,可理解并实现所述公开实施例的其它变化。在权利要求中,“包括”(comprising)一词不排除其他组成部分或步骤,“一”或“一个”不排除多个的情况。单个处理器或其它单元可以实现权利要求中列举的若干项功能。相互不同的从属权利要求中记载了某些措施,但这并不表示这些措施不能组合起来产生良好的效果。Although the present application has been described in conjunction with various embodiments here, however, in the process of implementing the claimed application, those skilled in the art can understand and Other variations of the disclosed embodiments are implemented. In the claims, the word "comprising" does not exclude other components or steps, and "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that these measures cannot be combined to advantage.
以上已经描述了本申请的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。Having described various embodiments of the present application above, the foregoing description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and alterations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principle of each embodiment, practical application or improvement of technology in the market, or to enable other ordinary skilled in the art to understand each embodiment disclosed herein.
Claims (10)
- 一种算子的调度运行时间比较方法,其特征在于,所述方法包括:A scheduling runtime comparison method of an operator, characterized in that the method comprises:获取目标计算表达对应的至少两个候选调度,所述目标计算表达用于描述算子的计算逻辑,所述候选调度为基于所述目标计算表达生成的所述算子在目标硬件平台上的可执行代码;Obtain at least two candidate schedules corresponding to the target computing expression, the target computing expression is used to describe the computing logic of the operator, and the candidate scheduling is the available schedule of the operator on the target hardware platform generated based on the target computing expression execute code;获取代价比较模型,所述代价比较模型为采用多个样本调度对神经网络进行训练得到的模型;Obtaining a cost comparison model, where the cost comparison model is a model obtained by training a neural network using multiple sample scheduling;根据所述至少两个候选调度,调用代价比较模型输出得到代价对比结果,所述代价对比结果用于指示所述至少两个候选调度在所述目标硬件平台上的执行时长的大小排序。According to the at least two candidate schedules, the output of the cost comparison model is invoked to obtain a cost comparison result, and the cost comparison result is used to indicate the order of the execution durations of the at least two candidate schedules on the target hardware platform.
- 根据权利要求1所述的方法,其特征在于,所述根据所述至少两个候选调度,调用代价比较模型输出得到代价对比结果,包括:The method according to claim 1, wherein, according to the at least two candidate schedules, calling the cost comparison model to output the cost comparison result includes:对所述至少两个候选调度进行预处理,得到预处理后的所述至少两个候选调度;Preprocessing the at least two candidate schedules to obtain the preprocessed at least two candidate schedules;将预处理后的所述至少两个候选调度输入至所述代价比较模型中,输出得到所述代价对比结果;inputting the preprocessed at least two candidate schedules into the cost comparison model, and outputting the cost comparison result;其中,所述代价比较模型是根据至少一组样本数据组训练得到的,每组所述样本数据组包括:样本计算表达对应的至少两个样本调度和预先标注的正确代价对比结果。Wherein, the cost comparison model is trained according to at least one set of sample data sets, and each set of sample data sets includes: at least two sample schedules corresponding to sample calculation expressions and pre-marked correct cost comparison results.
- 根据权利要求2所述的方法,其特征在于,所述对所述至少两个候选调度进行预处理,得到预处理后的所述至少两个候选调度,包括:The method according to claim 2, wherein the preprocessing the at least two candidate schedules to obtain the preprocessed at least two candidate schedules includes:对于所述至少两个候选调度中的每个所述候选调度,对所述候选调度进行特征提取得到特征矩阵;For each of the candidate schedules in the at least two candidate schedules, performing feature extraction on the candidate schedules to obtain a feature matrix;对所述候选调度对应的所述特征矩阵进行归一化处理,得到预处理后的所述候选调度。Perform normalization processing on the feature matrix corresponding to the candidate schedule to obtain the preprocessed candidate schedule.
- 根据权利要求3所述的方法,其特征在于,所述特征矩阵用于指示循环信息、输入数据形状信息、计算编码、轴类型编码和数据访问类型编码中的至少一种,所述循环信息包括与所述候选调度的循环计算逻辑相关的信息,所述输入数据形状信息用于描述所述算子的输入数据,所述计算编码包括所述候选调度的当前循环里用到的计算指令的编码,所述轴类型编码包括对轴进行操作的类型编码,所述数据访问类型编码包括对数据进行访问的类型编码。The method according to claim 3, wherein the characteristic matrix is used to indicate at least one of cycle information, input data shape information, calculation code, axis type code and data access type code, and the cycle information includes Information related to the cycle calculation logic of the candidate schedule, the input data shape information is used to describe the input data of the operator, and the calculation code includes the code of the calculation instruction used in the current cycle of the candidate schedule , the axis type encoding includes an axis type encoding, and the data access type encoding includes a data access type encoding.
- 根据权利要求2至4任一所述的方法,其特征在于,所述获取代价比较模型之前还包括:The method according to any one of claims 2 to 4, wherein said acquisition of the cost comparison model also includes:获取训练样本集,所述训练样本集包括至少一组所述样本数据组;Obtain a training sample set, the training sample set includes at least one set of the sample data set;对于每组所述样本数据组,对至少两个样本调度进行预处理得到预处理后的所述至少两个样本调度;For each set of sample data sets, preprocessing at least two sample schedules is performed to obtain the preprocessed at least two sample schedules;将预处理后的所述至少两个样本调度输入原始参数模型得到训练结果,所述原始参数模型为神经网络模型;inputting the preprocessed at least two sample schedules into an original parameter model to obtain a training result, and the original parameter model is a neural network model;将所述训练结果与所述正确代价对比结果进行比较,得到计算损失,所述计算损失用于指示所述训练结果与所述正确代价对比结果之间的误差;comparing the training result with the correct cost comparison result to obtain a calculation loss, the calculation loss being used to indicate an error between the training result and the correct cost comparison result;根据所述至少一组样本数据组各自对应的计算损失,采用误差反向传播算法训练得到所述代价比较模型。According to the calculated losses corresponding to the at least one set of sample data groups, the cost comparison model is obtained through training with an error backpropagation algorithm.
- 根据权利要求2至5任一所述的方法,其特征在于,所述根据所述至少两个候选调度,调用代价比较模型输出得到代价对比结果之后,还包括:The method according to any one of claims 2 to 5, characterized in that, after calling the cost comparison model to output the cost comparison result according to the at least two candidate schedules, further comprising:将所述至少两个候选调度和所述代价对比结果添加至所述训练样本集,得到更新后的训练样本集;adding the at least two candidate schedules and the cost comparison result to the training sample set to obtain an updated training sample set;根据所述更新后的训练样本集对所述代价比较模型进行训练,得到更新后的代价比较模型。The cost comparison model is trained according to the updated training sample set to obtain an updated cost comparison model.
- 一种算子的调度运行时间比较装置,其特征在于,所述装置包括:An operator scheduling running time comparison device, characterized in that the device includes:第一获取单元,用于获取目标计算表达对应的至少两个候选调度,所述目标计算表达用于描述算子的计算逻辑,所述候选调度为基于所述目标计算表达生成的所述算子的可执行代码;A first acquisition unit, configured to acquire at least two candidate schedules corresponding to a target calculation expression, the target calculation expression is used to describe the calculation logic of an operator, and the candidate schedule is the operator generated based on the target calculation expression the executable code;第二获取单元,用于获取代价比较模型,所述代价比较模型为采用多个样本调度对神经网络进行训练得到的模型;The second acquisition unit is used to acquire a cost comparison model, and the cost comparison model is a model obtained by training a neural network by adopting multiple sample scheduling;调用单元,用于根据所述至少两个候选调度,调用代价比较模型输出得到代价对比结果,所述代价对比结果用于指示所述至少两个候选调度的执行时长的大小排序。The calling unit is configured to call the output of the cost comparison model according to the at least two candidate schedules to obtain a cost comparison result, and the cost comparison result is used to indicate the order of the execution durations of the at least two candidate schedules.
- 一种算子的调度运行时间比较装置,其特征在于,所述装置包括:An operator scheduling running time comparison device, characterized in that the device includes:处理器;processor;用于存储处理器可执行指令的存储器;memory for storing processor-executable instructions;其中,所述处理器被配置为执行所述指令时实现权利要求1-6任意一项所述的方法。Wherein, the processor is configured to implement the method according to any one of claims 1-6 when executing the instructions.
- 一种非易失性计算机可读存储介质,其上存储有计算机程序指令,其特征在于,所述计算机程序指令被处理器执行时实现权利要求1-6任意一项所述的方法。A non-volatile computer-readable storage medium, on which computer program instructions are stored, wherein, when the computer program instructions are executed by a processor, the method according to any one of claims 1-6 is implemented.
- 一种计算机程序产品,其特征在于,所述计算机程序产品在计算机上运行时,所述计算机执行如权利要求1-6任意一项所述的方法。A computer program product, characterized in that, when the computer program product is run on a computer, the computer executes the method according to any one of claims 1-6.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202280006829.5A CN116897356A (en) | 2022-02-08 | 2022-02-08 | Operator scheduling run time comparison method, device and storage medium |
PCT/CN2022/075526 WO2023150912A1 (en) | 2022-02-08 | 2022-02-08 | Operator scheduling operation time comparison method and device, and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2022/075526 WO2023150912A1 (en) | 2022-02-08 | 2022-02-08 | Operator scheduling operation time comparison method and device, and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023150912A1 true WO2023150912A1 (en) | 2023-08-17 |
Family
ID=87563395
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/075526 WO2023150912A1 (en) | 2022-02-08 | 2022-02-08 | Operator scheduling operation time comparison method and device, and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN116897356A (en) |
WO (1) | WO2023150912A1 (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116755779A (en) * | 2023-08-18 | 2023-09-15 | 腾讯科技(深圳)有限公司 | Method, device, equipment, storage medium and chip for determining cycle interval |
CN117032936A (en) * | 2023-09-28 | 2023-11-10 | 之江实验室 | Data scheduling method and device and computer equipment |
CN117171577A (en) * | 2023-11-02 | 2023-12-05 | 之江实验室 | Dynamic decision method and device for high-performance operator selection |
CN118313429A (en) * | 2024-06-13 | 2024-07-09 | 之江实验室 | Model training video memory optimization method and device, electronic device and storage medium |
CN118689753A (en) * | 2024-08-23 | 2024-09-24 | 北京壁仞科技开发有限公司 | Data processing method and device, electronic equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180150783A1 (en) * | 2016-08-24 | 2018-05-31 | Clari Inc. | Method and system for predicting task completion of a time period based on task completion rates and data trend of prior time periods in view of attributes of tasks using machine learning models |
US20180373564A1 (en) * | 2017-06-22 | 2018-12-27 | Banuba Limited | Computer Systems And Computer-Implemented Methods For Dynamically Adaptive Distribution Of Workload Between Central Processing Unit(s) and Graphics Processing Unit(s) |
CN112668701A (en) * | 2020-12-31 | 2021-04-16 | 上海商汤智能科技有限公司 | Neural network operation method and device, electronic equipment and storage medium |
CN113128702A (en) * | 2021-04-15 | 2021-07-16 | 杭州电子科技大学 | Neural network self-adaptive distributed parallel training method based on reinforcement learning |
CN113342631A (en) * | 2021-07-02 | 2021-09-03 | 厦门美图之家科技有限公司 | Distribution management optimization method and device and electronic equipment |
CN113946412A (en) * | 2020-07-17 | 2022-01-18 | 阿里巴巴集团控股有限公司 | Scheduling search method and apparatus, cloud service providing method, electronic device, and computer-readable storage medium |
-
2022
- 2022-02-08 WO PCT/CN2022/075526 patent/WO2023150912A1/en active Application Filing
- 2022-02-08 CN CN202280006829.5A patent/CN116897356A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180150783A1 (en) * | 2016-08-24 | 2018-05-31 | Clari Inc. | Method and system for predicting task completion of a time period based on task completion rates and data trend of prior time periods in view of attributes of tasks using machine learning models |
US20180373564A1 (en) * | 2017-06-22 | 2018-12-27 | Banuba Limited | Computer Systems And Computer-Implemented Methods For Dynamically Adaptive Distribution Of Workload Between Central Processing Unit(s) and Graphics Processing Unit(s) |
CN113946412A (en) * | 2020-07-17 | 2022-01-18 | 阿里巴巴集团控股有限公司 | Scheduling search method and apparatus, cloud service providing method, electronic device, and computer-readable storage medium |
CN112668701A (en) * | 2020-12-31 | 2021-04-16 | 上海商汤智能科技有限公司 | Neural network operation method and device, electronic equipment and storage medium |
CN113128702A (en) * | 2021-04-15 | 2021-07-16 | 杭州电子科技大学 | Neural network self-adaptive distributed parallel training method based on reinforcement learning |
CN113342631A (en) * | 2021-07-02 | 2021-09-03 | 厦门美图之家科技有限公司 | Distribution management optimization method and device and electronic equipment |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116755779A (en) * | 2023-08-18 | 2023-09-15 | 腾讯科技(深圳)有限公司 | Method, device, equipment, storage medium and chip for determining cycle interval |
CN116755779B (en) * | 2023-08-18 | 2023-12-05 | 腾讯科技(深圳)有限公司 | Method, device, equipment, storage medium and chip for determining cycle interval |
CN117032936A (en) * | 2023-09-28 | 2023-11-10 | 之江实验室 | Data scheduling method and device and computer equipment |
CN117032936B (en) * | 2023-09-28 | 2024-02-06 | 之江实验室 | Data scheduling method and device and computer equipment |
CN117171577A (en) * | 2023-11-02 | 2023-12-05 | 之江实验室 | Dynamic decision method and device for high-performance operator selection |
CN117171577B (en) * | 2023-11-02 | 2024-03-22 | 之江实验室 | Dynamic decision method and device for high-performance operator selection |
CN118313429A (en) * | 2024-06-13 | 2024-07-09 | 之江实验室 | Model training video memory optimization method and device, electronic device and storage medium |
CN118689753A (en) * | 2024-08-23 | 2024-09-24 | 北京壁仞科技开发有限公司 | Data processing method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN116897356A (en) | 2023-10-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2023150912A1 (en) | Operator scheduling operation time comparison method and device, and storage medium | |
US20220374719A1 (en) | Application Development Platform and Software Development Kits that Provide Comprehensive Machine Learning Services | |
US12093675B2 (en) | Application development platform and software development kits that provide comprehensive machine learning services | |
US11790212B2 (en) | Quantization-aware neural architecture search | |
US20200265301A1 (en) | Incremental training of machine learning tools | |
CN110852438B (en) | Model generation method and device | |
US20220230048A1 (en) | Neural Architecture Scaling For Hardware Accelerators | |
US20190138887A1 (en) | Systems, methods, and media for gated recurrent neural networks with reduced parameter gating signals and/or memory-cell units | |
WO2023160290A1 (en) | Neural network inference acceleration method, target detection method, device, and storage medium | |
JP2018533153A (en) | Network model construction method and apparatus based on machine learning | |
Jin et al. | Rc-darts: Resource constrained differentiable architecture search | |
US20230196202A1 (en) | System and method for automatic building of learning machines using learning machines | |
CN116594748B (en) | Model customization processing method, device, equipment and medium for task | |
CN112149809A (en) | Model hyper-parameter determination method and device, calculation device and medium | |
US20220076095A1 (en) | Multi-level sparse neural networks with dynamic rerouting | |
KR20200063041A (en) | Method and apparatus for learning a neural network using unsupervised architecture variation and supervised selective error propagation | |
CN116011509A (en) | Hardware-aware machine learning model search mechanism | |
CN116097281A (en) | Theoretical superparameter delivery via infinite width neural networks | |
EP4217928A1 (en) | Neural architecture scaling for hardware accelerators | |
US11704562B1 (en) | Architecture for virtual instructions | |
CN116976461A (en) | Federal learning method, apparatus, device and medium | |
KR20210035702A (en) | Method of artificial neural network quantization and method of computation using artificial neural network | |
CN113762356B (en) | Cluster load prediction method and system based on clustering and attention mechanism | |
Metz et al. | Fast and Accurate: Machine Learning Techniques for Performance Estimation of CNNs for GPGPUs | |
CN112241786A (en) | Model hyper-parameter determination method and device, calculation device and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 202280006829.5 Country of ref document: CN |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22925277 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |